Documentation Index
Fetch the complete documentation index at: https://docs.langchain.com/llms.txt
Use this file to discover all available pages before exploring further.
PaddleOCR is a powerful and lightweight OCR toolkit developed by Baidu that connects images and PDFs with LLMs. It supports over 100 languages and transforms document content into structured, AI-ready data.
This integration provides PaddleOCR’s large-model document parsing capabilities via the PaddleOCRVLLoader document loader.
Overview
Integration details
| Class | Package | Local | Serializable | JS support |
|---|
PaddleOCRVLLoader | langchain-paddleocr | ✅ | ❌ | ❌ |
Loader features
| Source | Document Lazy Loading | Native Async Support |
|---|
PaddleOCRVLLoader | ✅ | ❌ |
The PaddleOCRVLLoader enables you to:
- Extract text and layout information from PDF and image files using models from Baidu’s PaddleOCR-VL series (e.g., PaddleOCR-VL, PaddleOCR-VL-1.5)
- Process documents from local files or remote URLs
Prerequisites
To use the PaddleOCR-VL loader, you need:
- API Access: Access to a PaddleOCR-VL API endpoint
- Authentication: An access token for the API (can be provided directly or via
PADDLEOCR_ACCESS_TOKEN environment variable)
Both the API URL and the access token are available on the PaddleOCR official website. Simply click the API button and copy the URL and token from the API invocation example provided there.
Setup
pip install langchain-paddleocr
Initialization
Basic initialization requires the API endpoint URL and file path:
from langchain_paddleocr import PaddleOCRVLLoader
from pydantic import SecretStr
loader = PaddleOCRVLLoader(
file_path="path/to/document.pdf",
api_url="your-api-endpoint",
access_token=SecretStr("your-access-token") # Optional if using environment variable
)
For authentication via environment variable:
export PADDLEOCR_ACCESS_TOKEN="your-access-token"
Then initialize without the access_token parameter:
loader = PaddleOCRVLLoader(
file_path="path/to/document.pdf",
api_url="your-api-endpoint"
)
Advanced Configuration
The loader supports numerous configuration options for fine-tuning the document processing:
loader = PaddleOCRVLLoader(
file_path=["document1.pdf", "document2.jpg"], # Multiple files
api_url="your-api-endpoint",
access_token=None, # Optional: SecretStr for API authentication
file_type="pdf", # Optional: "pdf" or "image", or None for auto-detection
use_doc_orientation_classify=False, # Enable document orientation classification
use_doc_unwarping=False, # Enable document unwarping
use_layout_detection=None, # Enable layout detection (None = use service default)
use_chart_recognition=None, # Enable chart recognition (None = use service default)
use_seal_recognition=None, # Enable seal recognition (None = use service default)
use_ocr_for_image_block=None, # Run OCR on image blocks (None = use service default)
layout_threshold=None, # Detection threshold (None = use service default)
layout_nms=None, # Apply non-maximum suppression (None = use service default)
layout_unclip_ratio=None, # Layout unclip ratio (None = use service default)
layout_merge_bboxes_mode=None, # Mode for merging layout bounding boxes (None = use service default)
layout_shape_mode=None, # Layout shape mode (None = use service default)
prompt_label=None, # Prompt label for VLM (None = use service default)
format_block_content=None, # Format block content (None = use service default)
repetition_penalty=None, # Repetition penalty for VLM sampling (None = use service default)
temperature=None, # Temperature for VLM sampling (None = use service default)
top_p=None, # Top-p sampling value for VLM (None = use service default)
min_pixels=None, # Minimum pixels allowed in preprocessing (None = use service default)
max_pixels=None, # Maximum pixels allowed in preprocessing (None = use service default)
max_new_tokens=None, # Maximum tokens generated by VLM (None = use service default)
merge_layout_blocks=None, # Merge layout blocks across columns (None = use service default)
markdown_ignore_labels=None, # Layout labels to ignore in Markdown (None = use service default)
vlm_extra_args=None, # Additional VLM configuration parameters (None = use service default)
prettify_markdown=None, # Prettify Markdown output (None = use service default)
show_formula_number=None, # Include formula numbers in Markdown (None = use service default)
restructure_pages=None, # Restructure results across pages (None = use service default)
merge_tables=None, # Merge tables across pages (None = use service default)
relevel_titles=None, # Relevel titles (None = use service default)
visualize=None, # Include visualization results (None = use service default)
additional_params=None, # Additional API parameters
timeout=300, # Request timeout in seconds
)
Basic Usage
Loading Documents
# Load a single document
loader = PaddleOCRVLLoader(
file_path="https://arxiv.org/pdf/2408.09869",
api_url="your-api-endpoint"
)
docs = loader.load()
# Inspect the results
for doc in docs[:2]:
print(f"Content: {doc.page_content[:200]}...")
print(f"Source: {doc.metadata['source']}")
print("---")
Handling Multiple File Types
The loader automatically detects file types based on extensions:
# Mixed file types - auto-detected
files = [
"document.pdf", # PDF file
"image.jpg", # Image file
"https://example.com/report.pdf" # Remote PDF
]
loader = PaddleOCRVLLoader(file_path=files, api_url="your-api-endpoint")
Supported image formats: .jpg, .jpeg, .png, .bmp, .tiff, .tif, .webp
Supported document formats: .pdf
Advanced Features
Accessing Raw API Responses
The loader includes the complete API response in document metadata:
docs = loader.load()
first_doc = docs[0]
# Access raw API response for advanced processing
raw_response = first_doc.metadata["paddleocr_vl_raw_response"]
print(f"Layout results: {len(raw_response['result']['layoutParsingResults'])}")
Error Handling
The loader provides detailed error messages for troubleshooting:
try:
docs = loader.load()
except ValueError as e:
print(f"Processing failed: {e}")
# Common issues: invalid API endpoint, authentication errors, unsupported file types
Best Practices
Error Handling
- Network Timeouts: Set appropriate
timeout parameter for large documents
- Authentication: Use environment variables for secure token management
- File Validation: Verify file accessibility before processing
Troubleshooting
Common Issues
- Authentication Errors: Ensure
PADDLEOCR_ACCESS_TOKEN is set or access_token is provided
- File Type Errors: Verify file extensions and accessibility
- API Connection Issues: Check endpoint URL and network connectivity
Debug Mode
For detailed debugging, examine the raw API response:
docs = loader.load()
if docs:
raw_response = docs[0].metadata.get("paddleocr_vl_raw_response")
print("API Response structure:", raw_response.keys())
API Reference