LangChain document loader for OpenDataLoader PDF — parse PDFs into structured Document objects for RAG pipelines.
- Accurate reading order — XY-Cut++ algorithm handles multi-column layouts correctly
- Table extraction — Preserves table structure in output
- Multiple formats — Text, Markdown, JSON, HTML
- 100% local — No cloud APIs, your documents never leave your machine
- Fast — Rule-based extraction, no GPU required
- Python >= 3.10
- Java 11+ available on system
PATH
pip install -U langchain-opendataloader-pdffrom langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
# Load a PDF as text
loader = OpenDataLoaderPDFLoader(
file_path="document.pdf",
format="text"
)
documents = loader.load()
print(documents[0].page_content)from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
# Single file
loader = OpenDataLoaderPDFLoader(file_path="report.pdf")
docs = loader.load()
# Multiple files
loader = OpenDataLoaderPDFLoader(
file_path=["report1.pdf", "report2.pdf", "documents/"]
)
docs = loader.load()# Plain text (default) - best for simple RAG
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="text")
# Markdown - preserves headings, lists, tables
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="markdown")
# JSON - structured data with bounding boxes
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="json")
# HTML - styled output
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="html")For accessible PDFs with structure tags (common in government/legal documents):
loader = OpenDataLoaderPDFLoader(
file_path="accessible_document.pdf",
use_struct_tree=True # Use native PDF structure
)For documents with complex tables:
loader = OpenDataLoaderPDFLoader(
file_path="financial_report.pdf",
format="markdown",
table_method="cluster" # Better for borderless tables
)loader = OpenDataLoaderPDFLoader(
file_path="encrypted.pdf",
password="secret123"
)# Images are excluded by default (image_output="off")
# This is optimal for text-based RAG pipelines
# Embed images as Base64 (for multimodal RAG)
loader = OpenDataLoaderPDFLoader(
file_path="doc.pdf",
format="markdown",
image_output="embedded",
image_format="jpeg" # or "png"
)loader = OpenDataLoaderPDFLoader(
file_path="doc.pdf",
quiet=True # No console output
)from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
# Load PDF
loader = OpenDataLoaderPDFLoader(
file_path="knowledge_base.pdf",
format="markdown",
quiet=True
)
documents = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(documents)
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)
# Query
results = vectorstore.similarity_search("What is the main topic?")| Parameter | Type | Default | Description |
|---|---|---|---|
file_path |
str | List[str] |
— | (Required) PDF file path(s) or directories |
format |
str |
"text" |
Output format: "text", "markdown", "json", "html" |
split_pages |
bool |
True |
Split into separate Documents per page |
quiet |
bool |
False |
Suppress console logging |
password |
str |
None |
Password for encrypted PDFs |
use_struct_tree |
bool |
False |
Use PDF structure tree (tagged PDFs) |
table_method |
str |
"default" |
"default" (border-based) or "cluster" (border + clustering) |
reading_order |
str |
"xycut" |
"xycut" or "off" |
keep_line_breaks |
bool |
False |
Preserve original line breaks |
image_output |
str |
"off" |
"off", "embedded" (Base64), or "external" |
image_format |
str |
"png" |
"png" or "jpeg" |
content_safety_off |
List[str] |
None |
Disable safety filters: "hidden-text", "off-page", "tiny", "hidden-ocg", "all" |
replace_invalid_chars |
str |
None |
Replacement for invalid characters |
Each returned Document includes metadata:
doc.metadata
# {'source': 'document.pdf', 'format': 'text', 'page': 1}MIT License. See LICENSE for details.
- OpenDataLoader PDF — Core PDF parsing engine
- LangChain Python Docs — Python API reference
- LangChain Integration Guide — Integration documentation
- PyPI Package