> ## Documentation Index
> Fetch the complete documentation index at: https://docs.langchain.com/llms.txt
> Use this file to discover all available pages before exploring further.

# OpenDataLoader PDF integration

> Integrate with the OpenDataLoader PDF document loader using LangChain Python.

**PDF Parsing for RAG:** Convert to Markdown & JSON, Fast, Local, No GPU

[OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) converts PDFs into **LLM-ready Markdown and JSON** with accurate reading order, table extraction, and bounding boxes—all running locally on your machine.

**Why developers choose OpenDataLoader:**

* **Deterministic**—Same input always produces same output (no LLM hallucinations)
* **Fast**—Process 100+ pages per second on CPU
* **Private**—100% local, zero data transmission
* **Accurate**—Bounding boxes for every element, correct multi-column reading order

## Overview

### Integration details

| Class                                                                              | Package                                                                                  | Local | Serializable | JS support |
| :--------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------- | :---: | :----------: | :--------: |
| [OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) | [`langchain-opendataloader-pdf`](https://pypi.org/project/langchain-opendataloader-pdf/) |   ✅   |       ❌      |      ❌     |

### Loader features

|           Source          | Document Lazy Loading | Native Async Support |
| :-----------------------: | :-------------------: | :------------------: |
| `OpenDataLoaderPDFLoader` |           ✅           |           ❌          |

The `OpenDataLoaderPDFLoader` component enables you to parse PDFs into structured [`Document`](https://reference.langchain.com/python/langchain-core/documents/base/Document) objects.

## Requirements

* Python >= 3.10
* Java 11 or newer available on the system `PATH`

## Installation

```bash theme={"theme":{"light":"catppuccin-latte","dark":"catppuccin-mocha"}}
pip install -U langchain-opendataloader-pdf
```

## Quick start

```python theme={"theme":{"light":"catppuccin-latte","dark":"catppuccin-mocha"}}
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
    file_path=["path/to/document.pdf", "path/to/folder"],
    format="text"
)
documents = loader.load()

for doc in documents:
    print(doc.metadata, doc.page_content[:80])
```

## Parameters

| Parameter               | Type               | Default     | Description                                                                              |
| ----------------------- | ------------------ | ----------- | ---------------------------------------------------------------------------------------- |
| `file_path`             | `str \| List[str]` | —           | **(Required)** PDF file path(s) or directories                                           |
| `format`                | `str`              | `"text"`    | Output format: `"text"`, `"markdown"`, `"json"`, `"html"`                                |
| `split_pages`           | `bool`             | `True`      | Split into separate Documents per page                                                   |
| `quiet`                 | `bool`             | `False`     | Suppress console logging                                                                 |
| `password`              | `str`              | `None`      | Password for encrypted PDFs                                                              |
| `use_struct_tree`       | `bool`             | `False`     | Use PDF structure tree (tagged PDFs)                                                     |
| `table_method`          | `str`              | `"default"` | `"default"` (border-based) or `"cluster"` (border + clustering)                          |
| `reading_order`         | `str`              | `"xycut"`   | `"xycut"` or `"off"`                                                                     |
| `keep_line_breaks`      | `bool`             | `False`     | Preserve original line breaks                                                            |
| `image_output`          | `str`              | `"off"`     | `"off"`, `"embedded"` (Base64), or `"external"`                                          |
| `image_format`          | `str`              | `"png"`     | `"png"` or `"jpeg"`                                                                      |
| `content_safety_off`    | `List[str]`        | `None`      | Disable safety filters: `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`, `"all"` |
| `replace_invalid_chars` | `str`              | `None`      | Replacement for invalid characters                                                       |

## Usage examples

### Output formats

```python theme={"theme":{"light":"catppuccin-latte","dark":"catppuccin-mocha"}}
# Plain text (default) - best for simple RAG
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="text")

# Markdown - preserves headings, lists, tables
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="markdown")

# JSON - structured data with bounding boxes
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="json")

# HTML - styled output
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="html")
```

### Tagged PDF support

For accessible PDFs with structure tags (common in government/legal documents):

```python theme={"theme":{"light":"catppuccin-latte","dark":"catppuccin-mocha"}}
loader = OpenDataLoaderPDFLoader(
    file_path="accessible_document.pdf",
    use_struct_tree=True  # Use native PDF structure
)
```

### Password-Protected PDFs

```python theme={"theme":{"light":"catppuccin-latte","dark":"catppuccin-mocha"}}
loader = OpenDataLoaderPDFLoader(
    file_path="encrypted.pdf",
    password="secret123"
)
```

### Image handling

```python theme={"theme":{"light":"catppuccin-latte","dark":"catppuccin-mocha"}}
# Images are excluded by default (image_output="off")
# This is optimal for text-based RAG pipelines

# Embed images as Base64 (for multimodal RAG)
loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    format="markdown",
    image_output="embedded",
    image_format="jpeg"  # or "png"
)
```

## Document metadata

Each returned `Document` includes metadata:

```python theme={"theme":{"light":"catppuccin-latte","dark":"catppuccin-mocha"}}
doc.metadata
# {'source': 'document.pdf', 'format': 'text', 'page': 1}
```

## Additional resources

* [LangChain OpenDataLoader PDF integration GitHub](https://github.com/opendataloader-project/langchain-opendataloader-pdf)
* [LangChain OpenDataLoader PDF integration PyPI package](https://pypi.org/project/langchain-opendataloader-pdf/)
* [OpenDataLoader PDF GitHub](https://github.com/opendataloader-project/opendataloader-pdf)
* [OpenDataLoader PDF Homepage](https://opendataloader.org/)

***

<div className="source-links">
  <Callout icon="terminal-2">
    [Connect these docs](/use-these-docs) to Claude, VSCode, and more via MCP for real-time answers.
  </Callout>

  <Callout icon="edit">
    [Edit this page on GitHub](https://github.com/langchain-ai/docs/edit/main/src/oss/python/integrations/document_loaders/opendataloader_pdf.mdx) or [file an issue](https://github.com/langchain-ai/docs/issues/new/choose).
  </Callout>
</div>
