Pathway is an open data processing framework. It allows you to easily develop data transformation pipelines and Machine Learning applications that work with live data sources and changing data.This notebook demonstrates how to use a live
Pathway
data indexing pipeline with Langchain
. You can query the results of this pipeline from your chains in the same manner as you would a regular vector store. However, under the hood, Pathway updates the index on each data change giving you always up-to-date answers.
In this notebook, we will use a public demo document processing pipeline that:
VectorStore
client, which implements the similarity_search
function to retrieve matching documents.
The basic pipeline used in this document allows to effortlessly build a simple vector index of files stored in a cloud location. However, Pathway provides everything needed to build realtime data pipelines and apps, including SQL-like able operations such as groupby-reductions and joins between disparate data sources, time-based grouping and windowing of data, and a wide array of connectors.
You’ll need to install langchain-community
with pip install -qU langchain-community
to use this integration
url
or the host
and port
of your document indexing pipeline. In the code below we use a publicly available demo pipeline, which REST API you can access at https://demo-document-indexing.pathway.stream
. This demo ingests documents from Google Drive and Sharepoint and maintains an index for retrieving documents.
PathwayVectorClient.get_vectorstore_statistics()
gives essential statistics on the state of the vector store, like the number of indexed files and the timestamp of last updated one. You can use it in your chains to tell the user how fresh is your knowledge base.
UTF-8
parser. You can find available parsers here.