Theunstructured
package from Unstructured.IO extracts clean text from raw source documents like PDFs and Word documents. This page covers how to use theunstructured
ecosystem within LangChain.
unstructured
and its
dependencies running.
unstructured
package, install the Python SDK with pip install unstructured-client
along with pip install langchain-unstructured
to use the UnstructuredLoader
and partition
remotely against the Unstructured API. This loader lives
in a LangChain partner repo instead of the langchain-community
repo and you will need an
api_key
, which you can generate a free key here.
pip install unstructured
along with pip install langchain-community
and use the same UnstructuredLoader
as mentioned above.
pip install "unstructured[docx]"
. Learn more about extras here.pip install "unstructured[all-docs]"
.brew install
for Mac.
Depending on what document types you’re parsing, you may not need all of these.
libmagic-dev
(filetype detection)poppler-utils
(images and PDFs)tesseract-ocr
(images and PDFs)qpdf
(PDFs)libreoffice
(MS Office docs)pandoc
(EPUBs)Unstructured
is in data loaders.
CHM
means Microsoft Compiled HTML Help
.
comma-separated values
(CSV
) file is a delimited text file that uses
a comma to separate values. Each line of the file is a data record.
Each record consists of one or more fields, separated by commas.
See a usage example.
e-book file format
that uses
the “.epub” file extension. The term is short for electronic publication and
is sometimes styled ePub
. EPUB
is supported by many e-readers, and compatible
software is available for most smartphones, tablets, and computers.
See a usage example.
Open Document Format for Office Applications (ODF)
, also known as OpenDocument
,
is an open file format for word processing documents, spreadsheets, presentations
and graphics and using ZIP-compressed XML files. It was developed with the aim of
providing an open, XML-based file format specification for office applications.
See a usage example.
reStructured Text
(RST
) file is a file format for textual data
used primarily in the Python programming language community for technical documentation.
See a usage example.
tab-separated values
(TSV
) file is a simple, text-based file format for storing tabular data.
Records are separated by newlines, and values within a record are separated by tab characters.
See a usage example.