This notebook provides a quick overview for getting started with the UnDatasIO document loader. UnDatasIO enables efficient loading and parsing of various document formats including PDF, PNG, JPG, JPEG, and JFIF, with features like document lazy loading and native async support, all through UnDatasIO’s secure cloud API. These capabilities make the processed data ready for generative AI workflows like RAG. For detailed documentation on all features and configurations, refer to the official API reference.

Overview

Loader features

SourceDocument Lazy LoadingNative Async Support
UnDatasIOLoader

Setup

Credentials

UnDatasIO requires an API token.
Generate a free token at undatas.io and set it in the cell below:
import getpass
import os

if "UNDATASIO_TOKEN" not in os.environ:
    os.environ["UNDATASIO_TOKEN"] = getpass.getpass(
        "Enter your UnDatasIO API token: "
    )

Installation

Normal Installation

The following packages are required to run the rest of this notebook.
# Install package, compatible with API partitioning
%pip install langchain-undatasio

Initialization

The UnDatasIOLoader supports single-file upload & parsing via the UnDatasIO cloud API.
from langchain_undatasio import UnDatasIOLoader

loader = UnDatasIOLoader(
    token=os.environ["UNDATASIO_TOKEN"],
    file_path="demo.pdf"
)

Load

docs = loader.load()
docs[0]
Document(
    metadata={'source': 'demo.pdf', 'task_id': 't1', 'file_id': 'f1'},
    page_content='Growing a Tail: Increasing Output Diversity in Large Language Models\n\nAuthors: Michal Shur-Ofry1, Bar Horowitz-Amsalem1†, Adir Rahamim2, Yonatan Belinkov2*\n\nAffiliations:\n\n1Law Faculty, Hebrew University of Jerusalem; Jerusalem, Israel.\n\n2Faculty of Computer Science, Technion – I'
)
print(docs[0].page_content[:300])
Growing a Tail: Increasing Output Diversity in Large Language Models

Authors: Michal Shur-Ofry1, Bar Horowitz-Amsalem1†, Adir Rahamim2, Yonatan Belinkov2*

Affiliations:

1Law Faculty, Hebrew University of Jerusalem; Jerusalem, Israel.

2Faculty of Computer Science, Technion – I

Lazy Load

UnDatasIOLoader supports lazy loading for memory-efficient iteration.
pages = []
for doc in loader.lazy_load():
    pages.append(doc)

pages[0]
Document(
    metadata={'source': 'demo.pdf', 'task_id': 't1', 'file_id': 'f1'},
    page_content='Growing a Tail: Increasing Output Diversity in Large Language Models\n\nAuthors: Michal Shur-Ofry1, Bar Horowitz-Amsalem1†, Adir Rahamim2, Yonatan Belinkov2*\n\nAffiliations:\n\n1Law Faculty, Hebrew University of Jerusalem; Jerusalem, Israel.\n\n2Faculty of Computer Science, Technion – I'
)

See Also