Skip to main content
PDF Parsing for RAG — Convert to Markdown & JSON, Fast, Local, No GPU OpenDataLoader PDF converts PDFs into LLM-ready Markdown and JSON with accurate reading order, table extraction, and bounding boxes — all running locally on your machine. Why developers choose OpenDataLoader:
  • Deterministic — Same input always produces same output (no LLM hallucinations)
  • Fast — Process 100+ pages per second on CPU
  • Private — 100% local, zero data transmission
  • Accurate — Bounding boxes for every element, correct multi-column reading order

Overview

Integration details

ClassPackageLocalSerializableJS support
OpenDataLoader PDFlangchain-opendataloader-pdf

Loader features

SourceDocument Lazy LoadingNative Async Support
OpenDataLoaderPDFLoader
The OpenDataLoaderPDFLoader component enables you to parse PDFs into structured Document objects.

Requirements

  • Python >= 3.10
  • Java 11 or newer available on the system PATH

Installation

pip install -U langchain-opendataloader-pdf

Quick start

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
    file_path=["path/to/document.pdf", "path/to/folder"],
    format="text"
)
documents = loader.load()

for doc in documents:
    print(doc.metadata, doc.page_content[:80])

Parameters

ParameterTypeDefaultDescription
file_pathstr | List[str](Required) PDF file path(s) or directories
formatstr"text"Output format: "text", "markdown", "json", "html"
split_pagesboolTrueSplit into separate Documents per page
quietboolFalseSuppress console logging
passwordstrNonePassword for encrypted PDFs
use_struct_treeboolFalseUse PDF structure tree (tagged PDFs)
table_methodstr"default""default" (border-based) or "cluster" (border + clustering)
reading_orderstr"xycut""xycut" or "off"
keep_line_breaksboolFalsePreserve original line breaks
image_outputstr"off""off", "embedded" (Base64), or "external"
image_formatstr"png""png" or "jpeg"
content_safety_offList[str]NoneDisable safety filters: "hidden-text", "off-page", "tiny", "hidden-ocg", "all"
replace_invalid_charsstrNoneReplacement for invalid characters

Usage examples

Output formats

# Plain text (default) - best for simple RAG
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="text")

# Markdown - preserves headings, lists, tables
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="markdown")

# JSON - structured data with bounding boxes
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="json")

# HTML - styled output
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="html")

Tagged PDF support

For accessible PDFs with structure tags (common in government/legal documents):
loader = OpenDataLoaderPDFLoader(
    file_path="accessible_document.pdf",
    use_struct_tree=True  # Use native PDF structure
)

Password-Protected PDFs

loader = OpenDataLoaderPDFLoader(
    file_path="encrypted.pdf",
    password="secret123"
)

Image handling

# Images are excluded by default (image_output="off")
# This is optimal for text-based RAG pipelines

# Embed images as Base64 (for multimodal RAG)
loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    format="markdown",
    image_output="embedded",
    image_format="jpeg"  # or "png"
)

Document metadata

Each returned Document includes metadata:
doc.metadata
# {'source': 'document.pdf', 'format': 'text', 'page': 1}

Additional resources


Connect these docs to Claude, VSCode, and more via MCP for real-time answers.