- Deterministic — Same input always produces same output (no LLM hallucinations)
- Fast — Process 100+ pages per second on CPU
- Private — 100% local, zero data transmission
- Accurate — Bounding boxes for every element, correct multi-column reading order
Overview
Integration details
| Class | Package | Local | Serializable | JS support |
|---|---|---|---|---|
| OpenDataLoader PDF | langchain-opendataloader-pdf | ✅ | ❌ | ❌ |
Loader features
| Source | Document Lazy Loading | Native Async Support |
|---|---|---|
| OpenDataLoaderPDFLoader | ✅ | ❌ |
OpenDataLoaderPDFLoader component enables you to parse PDFs into structured Document objects.
Requirements
- Python >= 3.10
- Java 11 or newer available on the system
PATH
Installation
Quick start
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path | str | List[str] | — | (Required) PDF file path(s) or directories |
format | str | "text" | Output format: "text", "markdown", "json", "html" |
split_pages | bool | True | Split into separate Documents per page |
quiet | bool | False | Suppress console logging |
password | str | None | Password for encrypted PDFs |
use_struct_tree | bool | False | Use PDF structure tree (tagged PDFs) |
table_method | str | "default" | "default" (border-based) or "cluster" (border + clustering) |
reading_order | str | "xycut" | "xycut" or "off" |
keep_line_breaks | bool | False | Preserve original line breaks |
image_output | str | "off" | "off", "embedded" (Base64), or "external" |
image_format | str | "png" | "png" or "jpeg" |
content_safety_off | List[str] | None | Disable safety filters: "hidden-text", "off-page", "tiny", "hidden-ocg", "all" |
replace_invalid_chars | str | None | Replacement for invalid characters |
Usage examples
Output formats
Tagged PDF support
For accessible PDFs with structure tags (common in government/legal documents):Password-Protected PDFs
Image handling
Document metadata
Each returnedDocument includes metadata:
Additional resources
- LangChain OpenDataLoader PDF integration GitHub
- LangChain OpenDataLoader PDF integration PyPI package
- OpenDataLoader PDF GitHub
- OpenDataLoader PDF Homepage