Document loaders - Docs by LangChain

Document loaders provide a standard interface for reading data from different sources (such as Slack, Notion, or Google Drive) into LangChain’s Document format. This ensures that data can be handled consistently regardless of the source. All document loaders implement the BaseLoader interface.

Interface

Each document loader may define its own parameters, but they share a common API:

load() – Loads all documents at once.
lazy_load() – Streams documents lazily, useful for large datasets.

from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(
    ...  # Integration-specific parameters here
)

# Load all documents
documents = loader.load()

# For large datasets, lazily load documents
for document in loader.lazy_load():
    print(document)

By category

Webpages

The below document loaders allow you to load webpages.

Document Loader	Description	Package/API
Web	Uses urllib and BeautifulSoup to load and parse HTML web pages	Package
Unstructured	Uses Unstructured to load and parse web pages	Package
RecursiveURL	Recursively scrapes all child links from a root URL	Package
Sitemap	Scrapes all pages on a given sitemap	Package
Spider	Crawler and scraper that returns LLM-ready data	API
Firecrawl	API service that can be deployed locally	API
Docling	Uses Docling to load and parse web pages	Package
Hyperbrowser	Platform for running and scaling headless browsers, can be used to scrape/crawl any site	API
AgentQL	Web interaction and structured data extraction from any web page using an AgentQL query or a Natural Language prompt	API

PDFs

The below document loaders allow you to load PDF documents.

Document Loader	Description	Package/API
PyPDF	Uses `pypdf` to load and parse PDFs	Package
Unstructured	Uses Unstructured’s open source library to load PDFs	Package
Amazon Textract	Uses AWS API to load PDFs	API
MathPix	Uses MathPix to load PDFs	Package
PDFPlumber	Load PDF files using PDFPlumber	Package
PyPDFDirectry	Load a directory with PDF files	Package
PyPDFium2	Load PDF files using PyPDFium2	Package
PyMuPDF	Load PDF files using PyMuPDF	Package
PyMuPDF4LLM	Load PDF content to Markdown using PyMuPDF4LLM	Package
PDFMiner	Load PDF files using PDFMiner	Package
Upstage Document Parse Loader	Load PDF files using UpstageDocumentParseLoader	Package
Docling	Load PDF files using Docling	Package
UnDatasIO	Load PDF files using UnDatasIO	Package
OpenDataLoader PDF	Load PDF files using OpenDataLoader PDF	Package

Cloud providers

The below document loaders allow you to load documents from your favorite cloud providers.

Document Loader	Description	Partner Package	API reference
AWS S3 Directory	Load documents from an AWS S3 directory	❌	`S3DirectoryLoader`
AWS S3 File	Load documents from an AWS S3 file	❌	`S3FileLoader`
Azure AI Data	Load documents from Azure AI services	❌	`AzureAIDataLoader`
Azure Blob Storage	Load documents from Azure Blob Storage	✅	`AzureBlobStorageLoader`
Dropbox	Load documents from Dropbox	❌	`DropboxLoader`
Google Cloud Storage Directory	Load documents from GCS bucket	✅	`GCSDirectoryLoader`
Google Cloud Storage File	Load documents from GCS file object	✅	`GCSFileLoader`
Google Drive	Load documents from Google Drive (Google Docs only)	✅	`GoogleDriveLoader`
Huawei OBS Directory	Load documents from Huawei Object Storage Service Directory	❌	`OBSDirectoryLoader`
Huawei OBS File	Load documents from Huawei Object Storage Service File	❌	`OBSFileLoader`
Microsoft OneDrive	Load documents from Microsoft OneDrive	❌	`OneDriveLoader`
Microsoft SharePoint	Load documents from Microsoft SharePoint	❌	`SharePointLoader`
Tencent COS Directory	Load documents from Tencent Cloud Object Storage Directory	❌	`TencentCOSDirectoryLoader`
Tencent COS File	Load documents from Tencent Cloud Object Storage File	❌	`TencentCOSFileLoader`

The below document loaders allow you to load documents from different social media platforms.

Document Loader	API reference
Twitter	`TwitterTweetLoader`
Reddit	`RedditPostsLoader`

Messaging services

The below document loaders allow you to load data from different messaging platforms.

Document Loader	API reference
Telegram	`TelegramChatFileLoader`
WhatsApp	`WhatsAppChatLoader`
Discord	`DiscordChatLoader`
Facebook Chat	`FacebookChatLoader`
Mastodon	`MastodonTootsLoader`

Productivity tools

The below document loaders allow you to load data from commonly used productivity tools.

Document Loader	API reference
Figma	`FigmaFileLoader`
Notion	`NotionDirectoryLoader`
Slack	`SlackDirectoryLoader`
Quip	`QuipLoader`
Trello	`TrelloLoader`
Roam	`RoamLoader`
GitHub	`GithubFileLoader`

Common file types

The below document loaders allow you to load data from common data formats.

Document Loader	Data Type
`CSVLoader`	CSV files
`Unstructured`	Many file types (see https://docs.unstructured.io/platform/supported-file-types)
`JSONLoader`	JSON files
`BSHTMLLoader`	HTML files
`DoclingLoader`	Various file types (see https://ds4sd.github.io/docling/)
`PolarisAIDataInsightLoader`	Various file types (see https://datainsight.polarisoffice.com/documentation?docType=doc_extract)

All document loaders

acreom

AgentQLLoader

AirbyteLoader

Airtable

Alibaba Cloud MaxCompute

Amazon Textract

Apify Dataset

ArxivLoader

AssemblyAI Audio Transcripts

AstraDB

Async Chromium

AsyncHtml

Athena

AWS S3 Directory

AWS S3 File

AZLyrics

Azure AI Data

Azure Blob Storage

Azure AI Document Intelligence

BibTeX

BiliBili

Blackboard

Blockchain

Box

Brave Search

Browserbase

Browserless

BSHTMLLoader

Cassandra

ChatGPT Data

College Confidential

Concurrent Loader

Confluence

CoNLL-U

Copy Paste

Couchbase

CSV

Cube Semantic Layer

Datadog Logs

Dedoc

Diffbot

Discord

Docling

Docugami

Docusaurus

Dropbox

Email

EPub

Etherscan

EverNote

Facebook Chat

Fauna

Figma

FireCrawl

Geopandas

Git

GitBook

GitHub

Glue Catalog

Google AlloyDB for PostgreSQL

Google BigQuery

Google Bigtable

Google Cloud SQL for SQL Server

Google Cloud SQL for MySQL

Google Cloud SQL for PostgreSQL

Google Cloud Storage Directory

Google Cloud Storage File

Google Firestore in Datastore Mode

Google Drive

Google El Carro for Oracle Workloads

Google Firestore (Native Mode)

Google Memorystore for Redis

Google Spanner

Google Speech-to-Text

Grobid

Gutenberg

Hacker News

Huawei OBS Directory

Huawei OBS File

HuggingFace Dataset

HyperbrowserLoader

iFixit

Images

Image Captions

IMSDb

Iugu

Joplin

JSONLoader

Jupyter Notebook

Kinetica

lakeFS

LangSmith

LarkSuite (FeiShu)

LLM Sherpa

Mastodon

MathPixPDFLoader

MediaWiki Dump

Merge Documents Loader

MHTML

Microsoft Excel

Microsoft OneDrive

Microsoft OneNote

Microsoft PowerPoint

Microsoft SharePoint

Microsoft Word

Near Blockchain

Modern Treasury

MongoDB

Needle Document Loader

News URL

Notion DB

Nuclia

Obsidian

OpenDataLoader PDF

Open Document Format (ODT)

Open City Data

Oracle Autonomous Database

Oracle AI Vector Search

Org-mode

Outline Document Loader

Pandas DataFrame

PDFMinerLoader

PDFPlumber

Pebblo Safe DocumentLoader

Polaris AI DataInsight

Polars DataFrame

Dell PowerScale

Psychic

PubMed

PullMdLoader

PyMuPDFLoader

PyMuPDF4LLM

PyPDFDirectoryLoader

PyPDFium2Loader

PyPDFLoader

PySpark

Quip

ReadTheDocs Documentation

Recursive URL

Roam

Rockset

rspace

RSS Feeds

RST

scrapfly

ScrapingAnt

SingleStore

Sitemap

Slack

Snowflake

Source Code

Spider

Spreedly

Stripe

Subtitle

SurrealDB

Tencent COS Directory

Tencent COS File

TensorFlow Datasets

TiDB

2Markdown

TOML

Trello

TSV

Twitter

UnDatasIO

Unstructured

UnstructuredMarkdownLoader

UnstructuredPDFLoader

Upstage

URL

Vsdx

Weather

WebBaseLoader

WhatsApp Chat

Wikipedia

UnstructuredXMLLoader

Xorbits Pandas DataFrame

YouTube Audio

YouTube Transcripts

YoutubeLoaderDL

Yuque

ZeroxPDFLoader

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

Integrations by component

​Interface

​By category

​Webpages

​PDFs

​Cloud providers

​Social platforms

​Messaging services

​Productivity tools

​Common file types

​All document loaders

acreom

AgentQLLoader

AirbyteLoader

Airtable

Alibaba Cloud MaxCompute

Amazon Textract

Apify Dataset

ArxivLoader

AssemblyAI Audio Transcripts

AstraDB

Async Chromium

AsyncHtml

Athena

AWS S3 Directory

AWS S3 File

AZLyrics

Azure AI Data

Azure Blob Storage

Azure AI Document Intelligence

BibTeX

BiliBili

Blackboard

Blockchain

Box

Brave Search

Browserbase

Browserless

BSHTMLLoader

Cassandra

ChatGPT Data

College Confidential

Concurrent Loader

Confluence

CoNLL-U

Copy Paste

Couchbase

CSV

Cube Semantic Layer

Datadog Logs

Dedoc

Diffbot

Discord

Docling

Docugami

Docusaurus

Dropbox

Email

EPub

Etherscan

EverNote

Facebook Chat

Fauna

Figma

FireCrawl

Geopandas

Git

GitBook

GitHub

Glue Catalog

Google AlloyDB for PostgreSQL

Google BigQuery

Google Bigtable

Google Cloud SQL for SQL Server

Google Cloud SQL for MySQL

Google Cloud SQL for PostgreSQL

Google Cloud Storage Directory

Google Cloud Storage File

Google Firestore in Datastore Mode

Interface

By category

Webpages

PDFs

Cloud providers

Social platforms

Messaging services

Productivity tools

Common file types

All document loaders