RecursiveUrlLoader
lets you recursively scrape all child links from a root URL and parse them into Documents.
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
RecursiveUrlLoader | langchain-community | ✅ | ❌ | ✅ |
Source | Document Lazy Loading | Native Async Support |
---|---|---|
RecursiveUrlLoader | ✅ | ❌ |
RecursiveUrlLoader
.
RecursiveUrlLoader
lives in the langchain-community
package. There’s no other required packages, though you will get richer default Document metadata if you have “beautifulsoup4` installed as well.
.load()
to synchronously load into memory all Documents, with one
Document per visited URL. Starting from the initial URL, we recurse through
all linked URLs up to the specified max_depth.
Let’s run through a basic example of how to use the RecursiveUrlLoader
on the Python 3.9 Documentation.
extractor
method:
metadata_extractor
to customize how Document metadata is extracted from the HTTP response. See the API reference for more on this.
RecursiveUrlLoader
, but there are many more modifications that can be made to best fit your use case. Using the parameters link_regex
and exclude_dirs
can help you filter out unwanted URLs, aload()
and alazy_load()
can be used for aynchronous loading, and more.
For detailed information on configuring and calling the RecursiveUrlLoader
, please see the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html.