Concepts
This guide focuses on retrieval of text data. We will cover the following concepts:- Documents and document loaders;
- Text splitters;
- Embeddings;
- Vector stores and retrievers.
Setup
Installation
This guide requires@langchain/community
and pdf-parse
:
LangSmith
Many of the applications you build with LangChain will contain multiple steps with multiple invocations of LLM calls. As these applications get more and more complex, it becomes crucial to be able to inspect what exactly is going on inside your chain or agent. The best way to do this is with LangSmith. After you sign up at the link above, make sure to set your environment variables to start logging traces:Documents and Document Loaders
LangChain implements a Document abstraction, which is intended to represent a unit of text and associated metadata. It has three attributes:page_content
: a string representing the content;metadata
: a dict containing arbitrary metadata;id
: (optional) a string identifier for the document.
metadata
attribute can capture information about the source of the document, its relationship to other documents, and other information. Note that an individual Document
object often represents a chunk of a larger document.
We can generate sample documents when desired:
Loading documents
Let’s load a PDF into a sequence ofDocument
objects. There is a sample PDF in the LangChain repo here — a 10-k filing for Nike from 2023. We can consult the LangChain documentation for available PDF document loaders.
PyPDFLoader
loads one Document
object per PDF page. For each, we can easily access:
- The string content of the page;
- Metadata containing the file name and page number.
Splitting
For both information retrieval and downstream question-answering purposes, a page may be too coarse a representation. Our goal in the end will be to retrieveDocument
objects that answer an input query, and further splitting our PDF will help ensure that the meanings of relevant portions of the document are not “washed out” by surrounding text.
We can use text splitters for this purpose. Here we will use a simple text splitter that partitions based on characters. We will split our documents into chunks of 1000 characters
with 200 characters of overlap between chunks. The overlap helps
mitigate the possibility of separating a statement from important
context related to it. We use the
RecursiveCharacterTextSplitter
,
which will recursively split the document using common separators like
new lines until each chunk is the appropriate size. This is the
recommended text splitter for generic text use cases.
We set add_start_index=True
so that the character index where each
split Document starts within the initial Document is preserved as
metadata attribute “start_index”.
Embeddings
Vector search is a common way to store and search over unstructured data (such as unstructured text). The idea is to store numeric vectors that are associated with the text. Given a query, we can embed it as a vector of the same dimension and use vector similarity metrics (such as cosine similarity) to identify related text. LangChain supports embeddings from dozens of providers. These models specify how text should be converted into a numeric vector. Let’s select a model:Vector stores
LangChain VectorStore objects contain methods for adding text andDocument
objects to the store, and querying them using various similarity metrics. They are often initialized with embedding models, which determine how text data is translated to numeric vectors.
LangChain includes a suite of integrations with different vector store technologies. Some vector stores are hosted by a provider (e.g., various cloud providers) and require specific credentials to use; some (such as Postgres) run in separate infrastructure that can be run locally or via a third-party; others can run in-memory for lightweight workloads. Let’s select a vector store:
VectorStore
that contains documents, we can query it. VectorStore includes methods for querying:
- Synchronously and asynchronously;
- By string query and by vector;
- With and without returning similarity scores;
- By similarity and maximum marginal relevance (to balance similarity with query to diversity in retrieved results).
Usage
Embeddings typically represent text as a “dense” vector such that texts with similar meanings are geometrically close. This lets us retrieve relevant information just by passing in a question, without knowledge of any specific key-terms used in the document. Return documents based on similarity to a string query:Retrievers
LangChainVectorStore
objects do not subclass Runnable. LangChain Retrievers are Runnables, so they implement a standard set of methods (e.g., synchronous and asynchronous invoke
and batch
operations). Although we can construct retrievers from vector stores, retrievers can interface with non-vector store sources of data, as well (such as external APIs).
Vectorstores implement an as_retriever
method that will generate a Retriever, specifically a VectorStoreRetriever. These retrievers include specific search_type
and search_kwargs
attributes that identify what methods of the underlying vector store to call, and how to parameterize them. For instance, we can replicate the above with the following:
VectorStoreRetriever
supports search types of "similarity"
(default), "mmr"
(maximum marginal relevance, described above), and "similarity_score_threshold"
. We can use the latter to threshold documents output by the retriever by similarity score.
Retrievers can easily be incorporated into more complex applications, such as retrieval-augmented generation (RAG) applications that combine a given question with retrieved context into a prompt for a LLM. To learn more about building such an application, check out the RAG tutorial tutorial.