langchain-opendataloader-pdf

LangChain document loader for OpenDataLoader PDF — parse PDFs into structured Document objects for RAG pipelines.

Features

Accurate reading order — XY-Cut++ algorithm handles multi-column layouts correctly
Table extraction — Preserves table structure in output
Multiple formats — Text, Markdown, JSON, HTML
100% local — No cloud APIs, your documents never leave your machine
Fast — Rule-based extraction, no GPU required

Requirements

Python >= 3.10
Java 11+ available on system PATH

Installation

pip install -U langchain-opendataloader-pdf

Quick Start

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

# Load a PDF as text
loader = OpenDataLoaderPDFLoader(
    file_path="document.pdf",
    format="text"
)
documents = loader.load()

print(documents[0].page_content)

Usage Examples

Basic Usage

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

# Single file
loader = OpenDataLoaderPDFLoader(file_path="report.pdf")
docs = loader.load()

# Multiple files
loader = OpenDataLoaderPDFLoader(
    file_path=["report1.pdf", "report2.pdf", "documents/"]
)
docs = loader.load()

Output Formats

# Plain text (default) - best for simple RAG
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="text")

# Markdown - preserves headings, lists, tables
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="markdown")

# JSON - structured data with bounding boxes
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="json")

# HTML - styled output
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="html")

Tagged PDF Support

For accessible PDFs with structure tags (common in government/legal documents):

loader = OpenDataLoaderPDFLoader(
    file_path="accessible_document.pdf",
    use_struct_tree=True  # Use native PDF structure
)

Table Detection

For documents with complex tables:

loader = OpenDataLoaderPDFLoader(
    file_path="financial_report.pdf",
    format="markdown",
    table_method="cluster"  # Better for borderless tables
)

Password-Protected PDFs

loader = OpenDataLoaderPDFLoader(
    file_path="encrypted.pdf",
    password="secret123"
)

Image Handling

# Images are excluded by default (image_output="off")
# This is optimal for text-based RAG pipelines

# Embed images as Base64 (for multimodal RAG)
loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    format="markdown",
    image_output="embedded",
    image_format="jpeg"  # or "png"
)

Suppress Logging

loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    quiet=True  # No console output
)

RAG Pipeline Example

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Load PDF
loader = OpenDataLoaderPDFLoader(
    file_path="knowledge_base.pdf",
    format="markdown",
    quiet=True
)
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(documents)

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)

# Query
results = vectorstore.similarity_search("What is the main topic?")

Parameters Reference

Parameter	Type	Default	Description
`file_path`	`str \| List[str]`	—	(Required) PDF file path(s) or directories
`format`	`str`	`"text"`	Output format: `"text"`, `"markdown"`, `"json"`, `"html"`
`split_pages`	`bool`	`True`	Split into separate Documents per page
`quiet`	`bool`	`False`	Suppress console logging
`password`	`str`	`None`	Password for encrypted PDFs
`use_struct_tree`	`bool`	`False`	Use PDF structure tree (tagged PDFs)
`table_method`	`str`	`"default"`	`"default"` (border-based) or `"cluster"` (border + clustering)
`reading_order`	`str`	`"xycut"`	`"xycut"` or `"off"`
`keep_line_breaks`	`bool`	`False`	Preserve original line breaks
`image_output`	`str`	`"off"`	`"off"`, `"embedded"` (Base64), or `"external"`
`image_format`	`str`	`"png"`	`"png"` or `"jpeg"`
`content_safety_off`	`List[str]`	`None`	Disable safety filters: `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`, `"all"`
`replace_invalid_chars`	`str`	`None`	Replacement for invalid characters

Document Metadata

Each returned Document includes metadata:

doc.metadata
# {'source': 'document.pdf', 'format': 'text', 'page': 1}

License

MIT License. See LICENSE for details.

Links

OpenDataLoader PDF — Core PDF parsing engine
LangChain Python Docs — Python API reference
LangChain Integration Guide — Integration documentation
PyPI Package

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
THIRD_PARTY		THIRD_PARTY
docs		docs
langchain_opendataloader_pdf		langchain_opendataloader_pdf
samples/pdf		samples/pdf
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

langchain-opendataloader-pdf

Features

Requirements

Installation

Quick Start

Usage Examples

Basic Usage

Output Formats

Tagged PDF Support

Table Detection

Password-Protected PDFs

Image Handling

Suppress Logging

RAG Pipeline Example

Parameters Reference

Document Metadata

License

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

opendataloader-project/langchain-opendataloader-pdf

Folders and files

Latest commit

History

Repository files navigation

langchain-opendataloader-pdf

Features

Requirements

Installation

Quick Start

Usage Examples

Basic Usage

Output Formats

Tagged PDF Support

Table Detection

Password-Protected PDFs

Image Handling

Suppress Logging

RAG Pipeline Example

Parameters Reference

Document Metadata

License

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages