Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
PyMuPDF4LLMLoader | langchain-pymupdf4llm | ✅ | ❌ | ❌ |
Source | Document Lazy Loading | Native Async Support | Extract Images | Extract Tables |
---|---|---|---|---|
PyMuPDF4LLMLoader | ✅ | ❌ | ✅ | ✅ |
langchain-pymupdf4llm
integration package.
page
(page number). But in some cases we could want to process the pdf as a single text flow (so we don’t cut some paragraphs in half). In this case you can use the single mode :
page
(page_number) metadata disappears. Here’s how to clearly identify where pages end in the text flow :
pages_delimiter
is \n-----\n\n.
But this could simply be \n, or \f to clearly indicate a page change, or <!— PAGE BREAK —> for seamless injection in a Markdown viewer without a visual effect.
open
to read the binary content of either a PDF or a markdown file, but you need different parsing logic to convert that binary data into text.
As a result, it can be helpful to decouple the parsing logic from the loading logic, which makes it easier to re-use a given parser regardless of how the data was loaded.
You can use this strategy to analyze different files, with the same parsing parameters.