Skip to content

opendataloader-project/langchain-opendataloader-pdf

Repository files navigation

langchain-opendataloader-pdf

LangChain document loader for OpenDataLoader PDF — parse PDFs into structured Document objects for RAG pipelines.

PyPI version License

Features

  • Accurate reading order — XY-Cut++ algorithm handles multi-column layouts correctly
  • Table extraction — Preserves table structure in output
  • Multiple formats — Text, Markdown, JSON, HTML
  • 100% local — No cloud APIs, your documents never leave your machine
  • Fast — Rule-based extraction, no GPU required

Requirements

  • Python >= 3.10
  • Java 11+ available on system PATH

Installation

pip install -U langchain-opendataloader-pdf

Quick Start

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

# Load a PDF as text
loader = OpenDataLoaderPDFLoader(
    file_path="document.pdf",
    format="text"
)
documents = loader.load()

print(documents[0].page_content)

Usage Examples

Basic Usage

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

# Single file
loader = OpenDataLoaderPDFLoader(file_path="report.pdf")
docs = loader.load()

# Multiple files
loader = OpenDataLoaderPDFLoader(
    file_path=["report1.pdf", "report2.pdf", "documents/"]
)
docs = loader.load()

Output Formats

# Plain text (default) - best for simple RAG
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="text")

# Markdown - preserves headings, lists, tables
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="markdown")

# JSON - structured data with bounding boxes
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="json")

# HTML - styled output
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="html")

Tagged PDF Support

For accessible PDFs with structure tags (common in government/legal documents):

loader = OpenDataLoaderPDFLoader(
    file_path="accessible_document.pdf",
    use_struct_tree=True  # Use native PDF structure
)

Table Detection

For documents with complex tables:

loader = OpenDataLoaderPDFLoader(
    file_path="financial_report.pdf",
    format="markdown",
    table_method="cluster"  # Better for borderless tables
)

Password-Protected PDFs

loader = OpenDataLoaderPDFLoader(
    file_path="encrypted.pdf",
    password="secret123"
)

Image Handling

# Images are excluded by default (image_output="off")
# This is optimal for text-based RAG pipelines

# Embed images as Base64 (for multimodal RAG)
loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    format="markdown",
    image_output="embedded",
    image_format="jpeg"  # or "png"
)

Suppress Logging

loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    quiet=True  # No console output
)

RAG Pipeline Example

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Load PDF
loader = OpenDataLoaderPDFLoader(
    file_path="knowledge_base.pdf",
    format="markdown",
    quiet=True
)
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(documents)

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)

# Query
results = vectorstore.similarity_search("What is the main topic?")

Parameters Reference

Parameter Type Default Description
file_path str | List[str] (Required) PDF file path(s) or directories
format str "text" Output format: "text", "markdown", "json", "html"
split_pages bool True Split into separate Documents per page
quiet bool False Suppress console logging
password str None Password for encrypted PDFs
use_struct_tree bool False Use PDF structure tree (tagged PDFs)
table_method str "default" "default" (border-based) or "cluster" (border + clustering)
reading_order str "xycut" "xycut" or "off"
keep_line_breaks bool False Preserve original line breaks
image_output str "off" "off", "embedded" (Base64), or "external"
image_format str "png" "png" or "jpeg"
content_safety_off List[str] None Disable safety filters: "hidden-text", "off-page", "tiny", "hidden-ocg", "all"
replace_invalid_chars str None Replacement for invalid characters

Document Metadata

Each returned Document includes metadata:

doc.metadata
# {'source': 'document.pdf', 'format': 'text', 'page': 1}

License

MIT License. See LICENSE for details.

Links

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages