@langchain/community
and pdf-parse
:
page_content
: a string representing the content;metadata
: a dict containing arbitrary metadata;id
: (optional) a string identifier for the document.metadata
attribute can capture information about the source of the document, its relationship to other documents, and other information. Note that an individual Document
object often represents a chunk of a larger document.
We can generate sample documents when desired:
Document
objects. There is a sample PDF in the LangChain repo here — a 10-k filing for Nike from 2023. We can consult the LangChain documentation for available PDF document loaders. Let’s select PyPDFLoader, which is fairly lightweight.
PyPDFLoader
loads one Document
object per PDF page. For each, we can easily access:
Document
objects that answer an input query, and further splitting our PDF will help ensure that the meanings of relevant portions of the document are not “washed out” by surrounding text.
We can use text splitters for this purpose. Here we will use a simple text splitter that partitions based on characters. We will split our documents into chunks of 1000 characters
with 200 characters of overlap between chunks. The overlap helps
mitigate the possibility of separating a statement from important
context related to it. We use the
RecursiveCharacterTextSplitter,
which will recursively split the document using common separators like
new lines until each chunk is the appropriate size. This is the
recommended text splitter for generic text use cases.
We set add_start_index=True
so that the character index where each
split Document starts within the initial Document is preserved as
metadata attribute “start_index”.
Document
objects to the store, and querying them using various similarity metrics. They are often initialized with embedding models, which determine how text data is translated to numeric vectors.
LangChain includes a suite of integrations with different vector store technologies. Some vector stores are hosted by a provider (e.g., various cloud providers) and require specific credentials to use; some (such as Postgres) run in separate infrastructure that can be run locally or via a third-party; others can run in-memory for lightweight workloads. Let’s select a vector store:
VectorStore
that contains documents, we can query it. VectorStore includes methods for querying:
VectorStore
objects do not subclass Runnable. LangChain Retrievers are Runnables, so they implement a standard set of methods (e.g., synchronous and asynchronous invoke
and batch
operations). Although we can construct retrievers from vector stores, retrievers can interface with non-vector store sources of data, as well (such as external APIs).
Vectorstores implement an as_retriever
method that will generate a Retriever, specifically a VectorStoreRetriever. These retrievers include specific search_type
and search_kwargs
attributes that identify what methods of the underlying vector store to call, and how to parameterize them. For instance, we can replicate the above with the following:
VectorStoreRetriever
supports search types of "similarity"
(default), "mmr"
(maximum marginal relevance, described above), and "similarity_score_threshold"
. We can use the latter to threshold documents output by the retriever by similarity score.
Retrievers can easily be incorporated into more complex applications, such as retrieval-augmented generation (RAG) applications that combine a given question with retrieved context into a prompt for a LLM. To learn more about building such an application, check out the RAG tutorial tutorial.