Skip to main content
PaddleOCR is a powerful and lightweight OCR toolkit developed by Baidu that connects images and PDFs with LLMs. It supports over 100 languages and transforms document content into structured, AI-ready data. This integration provides PaddleOCR’s large-model document parsing capabilities via the PaddleOCRVLLoader document loader.

Overview

Integration details

ClassPackageLocalSerializableJS support
PaddleOCRVLLoaderlangchain-paddleocr

Loader features

SourceDocument Lazy LoadingNative Async Support
PaddleOCRVLLoader
The PaddleOCRVLLoader enables you to:
  • Extract text and layout information from PDF and image files using models from Baidu’s PaddleOCR-VL series (e.g., PaddleOCR-VL, PaddleOCR-VL-1.5)
  • Process documents from local files or remote URLs

Prerequisites

To use the PaddleOCR-VL loader, you need:
  1. API Access: Access to a PaddleOCR-VL API endpoint
  2. Authentication: An access token for the API (can be provided directly or via PADDLEOCR_ACCESS_TOKEN environment variable)
Both the API URL and the access token are available on the PaddleOCR official website. Simply click the API button and copy the URL and token from the API invocation example provided there.

Setup

pip install langchain-paddleocr

Initialization

Basic initialization requires the API endpoint URL and file path:
from langchain_paddleocr import PaddleOCRVLLoader
from pydantic import SecretStr

loader = PaddleOCRVLLoader(
    file_path="path/to/document.pdf",
    api_url="your-api-endpoint",
    access_token=SecretStr("your-access-token")  # Optional if using environment variable
)
For authentication via environment variable:
export PADDLEOCR_ACCESS_TOKEN="your-access-token"
Then initialize without the access_token parameter:
loader = PaddleOCRVLLoader(
    file_path="path/to/document.pdf",
    api_url="your-api-endpoint"
)

Advanced Configuration

The loader supports numerous configuration options for fine-tuning the document processing:
loader = PaddleOCRVLLoader(
    file_path=["document1.pdf", "document2.jpg"],  # Multiple files
    api_url="your-api-endpoint",

    access_token=None,  # Optional: SecretStr for API authentication
    file_type="pdf",  # Optional: "pdf" or "image", or None for auto-detection

    use_doc_orientation_classify=False,  # Enable document orientation classification
    use_doc_unwarping=False,  # Enable document unwarping
    use_layout_detection=None,  # Enable layout detection (None = use service default)
    use_chart_recognition=None,  # Enable chart recognition (None = use service default)
    use_seal_recognition=None,  # Enable seal recognition (None = use service default)
    use_ocr_for_image_block=None,  # Run OCR on image blocks (None = use service default)

    layout_threshold=None,  # Detection threshold (None = use service default)
    layout_nms=None,  # Apply non-maximum suppression (None = use service default)
    layout_unclip_ratio=None,  # Layout unclip ratio (None = use service default)
    layout_merge_bboxes_mode=None,  # Mode for merging layout bounding boxes (None = use service default)
    layout_shape_mode=None,  # Layout shape mode (None = use service default)

    prompt_label=None,  # Prompt label for VLM (None = use service default)
    format_block_content=None,  # Format block content (None = use service default)
    repetition_penalty=None,  # Repetition penalty for VLM sampling (None = use service default)
    temperature=None,  # Temperature for VLM sampling (None = use service default)
    top_p=None,  # Top-p sampling value for VLM (None = use service default)
    min_pixels=None,  # Minimum pixels allowed in preprocessing (None = use service default)
    max_pixels=None,  # Maximum pixels allowed in preprocessing (None = use service default)
    max_new_tokens=None,  # Maximum tokens generated by VLM (None = use service default)

    merge_layout_blocks=None,  # Merge layout blocks across columns (None = use service default)
    markdown_ignore_labels=None,  # Layout labels to ignore in Markdown (None = use service default)
    vlm_extra_args=None,  # Additional VLM configuration parameters (None = use service default)

    prettify_markdown=None,  # Prettify Markdown output (None = use service default)
    show_formula_number=None,  # Include formula numbers in Markdown (None = use service default)
    restructure_pages=None,  # Restructure results across pages (None = use service default)
    merge_tables=None,  # Merge tables across pages (None = use service default)
    relevel_titles=None,  # Relevel titles (None = use service default)
    visualize=None,  # Include visualization results (None = use service default)

    additional_params=None,  # Additional API parameters
    timeout=300,  # Request timeout in seconds
)

Basic Usage

Loading Documents

# Load a single document
loader = PaddleOCRVLLoader(
    file_path="https://arxiv.org/pdf/2408.09869",
    api_url="your-api-endpoint"
)
docs = loader.load()

# Inspect the results
for doc in docs[:2]:
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Source: {doc.metadata['source']}")
    print("---")

Handling Multiple File Types

The loader automatically detects file types based on extensions:
# Mixed file types - auto-detected
files = [
    "document.pdf",      # PDF file
    "image.jpg",         # Image file
    "https://example.com/report.pdf"  # Remote PDF
]

loader = PaddleOCRVLLoader(file_path=files, api_url="your-api-endpoint")
Supported image formats: .jpg, .jpeg, .png, .bmp, .tiff, .tif, .webp Supported document formats: .pdf

Advanced Features

Accessing Raw API Responses

The loader includes the complete API response in document metadata:
docs = loader.load()
first_doc = docs[0]

# Access raw API response for advanced processing
raw_response = first_doc.metadata["paddleocr_vl_raw_response"]
print(f"Layout results: {len(raw_response['result']['layoutParsingResults'])}")

Error Handling

The loader provides detailed error messages for troubleshooting:
try:
    docs = loader.load()
except ValueError as e:
    print(f"Processing failed: {e}")
    # Common issues: invalid API endpoint, authentication errors, unsupported file types

Best Practices

Error Handling

  • Network Timeouts: Set appropriate timeout parameter for large documents
  • Authentication: Use environment variables for secure token management
  • File Validation: Verify file accessibility before processing

Troubleshooting

Common Issues

  1. Authentication Errors: Ensure PADDLEOCR_ACCESS_TOKEN is set or access_token is provided
  2. File Type Errors: Verify file extensions and accessibility
  3. API Connection Issues: Check endpoint URL and network connectivity

Debug Mode

For detailed debugging, examine the raw API response:
docs = loader.load()
if docs:
    raw_response = docs[0].metadata.get("paddleocr_vl_raw_response")
    print("API Response structure:", raw_response.keys())

API Reference