ZeroxPDFLoader
is a document loader that leverages the Zerox library. Zerox converts PDF documents into images, processes them using a vision-capable language model, and generates a structured Markdown representation. This loader allows for asynchronous operations and provides page-level document extraction.
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
ZeroxPDFLoader | langchain_community | ❌ | ❌ | ❌ |
Source | Document Lazy Loading | Native Async Support |
---|---|---|
ZeroxPDFLoader | ✅ | ❌ |
ZeroxPDFLoader
, you need to install the zerox
package. Also make sure to have langchain-community
installed.
ZeroxPDFLoader
enables PDF text extraction using vision-capable language models by converting each page into an image and processing it asynchronously. To use this loader, you need to specify a model and configure any necessary environment variables for Zerox, such as API keys.
If you’re working in an environment like Jupyter Notebook, you may need to handle asynchronous code by using nest_asyncio
. You can set this up as follows:
.load()
method is equivalent to .lazy_load()
ZeroxPDFLoader
zerox_kwargs
for handling Zerox-specific parameters.
Arguments:
file_path
(Union[str, Path]): Path to the PDF file.model
(str): Vision-capable model to use for processing in format <provider>/<model>
.
Some examples of valid values are:
model = "gpt-4o-mini" ## openai model
model = "azure/gpt-4o-mini"
model = "gemini/gpt-4o-mini"
model="claude-3-opus-20240229"
model = "vertex_ai/gemini-1.5-flash-001"
"gpt-4o-mini".
**zerox_kwargs
(dict): Additional Zerox-specific parameters such as API key, endpoint, etc.
lazy_load
: Generates an iterator of Document
instances, each representing a page of the PDF, along with metadata including page number and source.API_KEY
or endpoint details, as specified in the Zerox documentation.nest_asyncio
as shown in the setup section.nest_asyncio.apply()
to prevent asynchronous loop conflicts in environments like Jupyter.zerox_kwargs
match the expected arguments for your chosen model and that all necessary environment variables are set.