Microsoft PowerPoint is a presentation program by Microsoft.This covers how to load
Microsoft PowerPoint
documents into a document format that we can use downstream.
Please see this guide for more instructions on setting up Unstructured locally, including setting up required system dependencies.
Unstructured
creates different “elements” for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying mode="elements"
.
Azure AI Document Intelligence (formerly known asThis current implementation of a loader usingAzure Form Recognizer
) is machine-learning based service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs from digital or scanned PDFs, images, Office and HTML files. Document Intelligence supportsJPEG/JPG
,PNG
,BMP
,TIFF
,HEIF
,DOCX
,XLSX
,PPTX
andHTML
.
Document Intelligence
can incorporate content page-wise and turn it into LangChain documents. The default output format is markdown, which can be easily chained with MarkdownHeaderTextSplitter
for semantic document chunking. You can also use mode="single"
or mode="page"
to return pure texts in a single page or document split by page.
<endpoint>
and <key>
as parameters to the loader.