OpenDataLoader PDF integration

PDF Parsing for RAG — Convert to Markdown & JSON, Fast, Local, No GPU OpenDataLoader PDF converts PDFs into LLM-ready Markdown and JSON with accurate reading order, table extraction, and bounding boxes — all running locally on your machine. Why developers choose OpenDataLoader:

Deterministic — Same input always produces same output (no LLM hallucinations)
Fast — Process 100+ pages per second on CPU
Private — 100% local, zero data transmission
Accurate — Bounding boxes for every element, correct multi-column reading order

Overview

Integration details

Class	Package	Local	Serializable	JS support
OpenDataLoader PDF	langchain-opendataloader-pdf	✅	❌	❌

Loader features

Source	Document Lazy Loading	Native Async Support
OpenDataLoaderPDFLoader	✅	❌

The OpenDataLoaderPDFLoader component enables you to parse PDFs into structured Document objects.

Requirements

Python >= 3.10
Java 11 or newer available on the system PATH

Installation

pip install -U langchain-opendataloader-pdf

Quick start

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
    file_path=["path/to/document.pdf", "path/to/folder"],
    format="text"
)
documents = loader.load()

for doc in documents:
    print(doc.metadata, doc.page_content[:80])

Parameters

Parameter	Type	Default	Description
`file_path`	`str \| List[str]`	—	(Required) PDF file path(s) or directories
`format`	`str`	`"text"`	Output format: `"text"`, `"markdown"`, `"json"`, `"html"`
`split_pages`	`bool`	`True`	Split into separate Documents per page
`quiet`	`bool`	`False`	Suppress console logging
`password`	`str`	`None`	Password for encrypted PDFs
`use_struct_tree`	`bool`	`False`	Use PDF structure tree (tagged PDFs)
`table_method`	`str`	`"default"`	`"default"` (border-based) or `"cluster"` (border + clustering)
`reading_order`	`str`	`"xycut"`	`"xycut"` or `"off"`
`keep_line_breaks`	`bool`	`False`	Preserve original line breaks
`image_output`	`str`	`"off"`	`"off"`, `"embedded"` (Base64), or `"external"`
`image_format`	`str`	`"png"`	`"png"` or `"jpeg"`
`content_safety_off`	`List[str]`	`None`	Disable safety filters: `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`, `"all"`
`replace_invalid_chars`	`str`	`None`	Replacement for invalid characters

Usage examples

Output formats

# Plain text (default) - best for simple RAG
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="text")

# Markdown - preserves headings, lists, tables
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="markdown")

# JSON - structured data with bounding boxes
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="json")

# HTML - styled output
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="html")

Tagged PDF support

For accessible PDFs with structure tags (common in government/legal documents):

loader = OpenDataLoaderPDFLoader(
    file_path="accessible_document.pdf",
    use_struct_tree=True  # Use native PDF structure
)

Password-Protected PDFs

loader = OpenDataLoaderPDFLoader(
    file_path="encrypted.pdf",
    password="secret123"
)

Image handling

# Images are excluded by default (image_output="off")
# This is optimal for text-based RAG pipelines

# Embed images as Base64 (for multimodal RAG)
loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    format="markdown",
    image_output="embedded",
    image_format="jpeg"  # or "png"
)

Document metadata

Each returned Document includes metadata:

doc.metadata
# {'source': 'document.pdf', 'format': 'text', 'page': 1}

Additional resources

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

Integrations by component

OpenDataLoader PDF integration

Overview

Integration details

Loader features

Requirements

Installation

Quick start

Parameters

Usage examples

Output formats

Tagged PDF support

Password-Protected PDFs

Image handling

Document metadata

Additional resources

Popular Providers

Integrations by component

​Overview

​Integration details

​Loader features

​Requirements

​Installation

​Quick start

​Parameters

​Usage examples

​Output formats

​Tagged PDF support

​Password-Protected PDFs

​Image handling

​Document metadata

​Additional resources

Overview

Integration details

Loader features

Requirements

Installation

Quick start

Parameters

Usage examples

Output formats

Tagged PDF support

Password-Protected PDFs

Image handling

Document metadata

Additional resources