Apify Dataset is a scalable append-only storage with sequential access built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel. Datasets are mainly used to save results of Apify Actors—serverless cloud programs for various web scraping, crawling, and data extraction use cases.This notebook shows how to load Apify datasets to LangChain.
Integration details
| Class | Package | Serializable | JS support | Version |
|---|---|---|---|---|
| ApifyDatasetLoader | langchain-apify | ❌ | ✅ |
Loader features
| Source | Document Lazy Loading | Native Async Support |
|---|---|---|
| Apify Dataset | ❌ | ❌ |
Prerequisites
You need to have an existing dataset on the Apify platform. This example shows how to load a dataset produced by the Website Content Crawler.ApifyDatasetLoader into your source code:
Pricing
Apify Actors can be priced in different ways, depending on the Actor you run. Many Actors support Pay-Per-Event (PPE) pricing, where you pay for explicit events defined by the Actor author (for example, per dataset item). This can be a good fit for agent workloads where you want clear, per-operation costs.Map dataset items to documents
Next, define a function that maps Apify dataset record fields to LangChainDocument format.
For example, if your dataset items are structured like this:
Document format, so that you can use them further with any LLM model (e.g. for question answering).
An example with question answering
In this example, we use data from a dataset to answer a question.Using the Apify MCP server
Unsure which Actor to use or what parameters it requires? The Apify MCP (Model Context Protocol) server can help you discover available Actors, explore their input schemas, and understand parameter requirements. When connecting to the Apify MCP server over HTTP, include your Apify token in the request headers:Connect these docs to Claude, VSCode, and more via MCP for real-time answers.