Cheerio integration

This notebook provides a quick overview for getting started with CheerioWebBaseLoader. For detailed documentation of all CheerioWebBaseLoader features and configurations head to the API reference.

Overview

Integration details

This example goes over how to load data from webpages using Cheerio. One document will be created for each webpage. Cheerio is a fast and lightweight library that allows you to parse and traverse HTML documents using a jQuery-like syntax. You can use Cheerio to extract data from web pages, without having to render them in a browser. However, Cheerio does not simulate a web browser, so it cannot execute JavaScript code on the page. This means that it cannot extract data from dynamic web pages that require JavaScript to render. To do that, you can use the PlaywrightWebBaseLoader or PuppeteerWebBaseLoader instead.

Class	Package	Local	Serializable	PY support
CheerioWebBaseLoader	@langchain/community	✅	✅	❌

Loader features

Source	Web Support	Node Support
CheerioWebBaseLoader	✅	✅

Setup

To access CheerioWebBaseLoader document loader you’ll need to install the @langchain/community integration package, along with the cheerio peer dependency.

Credentials

If you want to get automated tracing of your model calls you can also set your LangSmith API key by uncommenting below:

# export LANGSMITH_TRACING="true"
# export LANGSMITH_API_KEY="your-api-key"

Installation

The LangChain CheerioWebBaseLoader integration lives in the @langchain/community package:

npm install @langchain/community @langchain/core cheerio

Instantiation

Now we can instantiate our model object and load documents:

import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio"

const loader = new CheerioWebBaseLoader("https://news.ycombinator.com/item?id=34817881", {
  // optional params: ...
})

Load

const docs = await loader.load()
docs[0]

Document {
  pageContent: '\n' +
    '        \n' +
    '                  Hacker News\n' +
    '                            new | past | comments | ask | show | jobs | submit            \n' +
    '                              login\n' +
    '                          \n' +
    '              \n' +
    '\n' +
    '        \n' +
    '            What Lights the Universe’s Standard Candles? (quantamagazine.org)\n' +
    '          75 points by Amorymeltzer on Feb 17, 2023  | hide | past | favorite | 6 comments        \n' +
    '              \n' +
    '        \n' +
    '                  \n' +
    '          \n' +
    '          delta_p_delta_x on Feb 17, 2023           \n' +
    '             | next [–]          \n' +
    '                  \n' +
    "                  Astrophysical and cosmological simulations are often insightful. They're also very cross-disciplinary; besides the obvious astrophysics, there's networking and sysadmin, parallel computing and algorithm theory (so that the simulation programs are actually fast but still accurate), systems design, and even a bit of graphic design for the visualisations.Some of my favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/- SWIFT: https://swift.dur.ac.uk/- CO5BOLD: https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)- AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the simulations in the article, too.\n" +
    '                      \n' +
    '                  \n' +
    '      \n' +
    '        \n' +
    '                      \n' +
    '          \n' +
    '          froeb on Feb 18, 2023           \n' +
    '             | parent | next [–]          \n' +
    '                  \n' +
    "                  Supernova simulations are especially interesting too. I have heard them described as the only time in physics when all 4 of the fundamental forces are important. The explosion can be quite finicky too. If I remember right, you can't get supernova to explode properly in 1D simulations, only in higher dimensions. This was a mystery until the realization that turbulence is necessary for supernova to trigger--there is no turbulent flow in 1D.\n" +
    '                      \n' +
    '                  \n' +
    '      \n' +
    '        \n' +
    '                        \n' +
    '          \n' +
    '          andrewflnr on Feb 17, 2023           \n' +
    '             | prev | next [–]          \n' +
    '                  \n' +
    "                  Whoa. I didn't know the accretion theory of Ia supernovae was dead, much less that it had been since 2011.\n" +
    '                      \n' +
    '                  \n' +
    '      \n' +
    '        \n' +
    '                  \n' +
    '          \n' +
    '          andreareina on Feb 17, 2023           \n' +
    '             | prev | next [–]          \n' +
    '                  \n' +
    '                  This seems  to be the paper https://academic.oup.com/mnras/article/517/4/5260/6779709\n' +
    '                      \n' +
    '                  \n' +
    '      \n' +
    '        \n' +
    '                  \n' +
    '          \n' +
    '          andreareina on Feb 17, 2023           \n' +
    '             | prev [–]          \n' +
    '                  \n' +
    "                  Wouldn't double detonation show up as variance in the brightness?\n" +
    '                      \n' +
    '                  \n' +
    '      \n' +
    '        \n' +
    '                      \n' +
    '          \n' +
    '          yencabulator on Feb 18, 2023           \n' +
    '             | parent [–]          \n' +
    '                  \n' +
    '                  Or widening of the peak. If one type Ia supernova goes 1,2,3,2,1, the sum of two could go    1+0=1\n' +
    '    2+1=3\n' +
    '    3+2=5\n' +
    '    2+3=5\n' +
    '    1+2=3\n' +
    '    0+1=1\n' +
    '                      \n' +
    '                  \n' +
    '      \n' +
    '        \n' +
    '                  \n' +
    '  \n' +
    '\n' +
    '\n' +
    'Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact\n' +
    'Search:       \n' +
    '      \n' +
    '  \n',
  metadata: { source: 'https://news.ycombinator.com/item?id=34817881' },
  id: undefined
}

console.log(docs[0].metadata)

{ source: 'https://news.ycombinator.com/item?id=34817881' }

Additional configurations

CheerioWebBaseLoader supports additional configuration when instantiating the loader. Here is an example of how to use it with the selector field passed, making it only load content from the provided HTML class names:

import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio"

const loaderWithSelector = new CheerioWebBaseLoader("https://news.ycombinator.com/item?id=34817881", {
  selector: "p",
});

const docsWithSelector = await loaderWithSelector.load();
docsWithSelector[0].pageContent;

Some of my favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/- SWIFT: https://swift.dur.ac.uk/- CO5BOLD: https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)- AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the simulations in the article, too.

API reference

For detailed documentation of all CheerioWebBaseLoader features and configurations head to the API reference.

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

General integrations

RAG integrations

Overview

Integration details

Loader features

Setup

Credentials

Installation

Instantiation

Load

Additional configurations

API reference

Popular Providers

General integrations

RAG integrations

​Overview

​Integration details

​Loader features

​Setup

​Credentials

​Installation

​Instantiation

​Load

​Additional configurations

​API reference

Overview

Integration details

Loader features

Setup

Credentials

Installation

Instantiation

Load

Additional configurations

API reference