WebBaseLoader
, SitemapLoader
loads a sitemap from a given URL, and then scrapes and loads all pages in the sitemap, returning each page as a Document.
The scraping is done concurrently. There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren’t concerned about being a good citizen, or you control the scrapped server, or don’t care about load you can increase this limit. Note, while this will speed up the scraping process, it may cause the server to block you. Be careful!
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
SiteMapLoader | langchain-community | ✅ | ❌ | ✅ |
Source | Document Lazy Loading | Native Async Support |
---|---|---|
SiteMapLoader | ✅ | ❌ |
langchain-community
integration package.
requests_per_second
parameter to increase the max concurrent requests. and use requests_kwargs
to pass kwargs when send requests.
filter_urls
parameter. Only URLs that match one of the patterns will be loaded.
SitemapLoader
uses beautifulsoup4
for the scraping process, and it scrapes every element on the page by default. The SitemapLoader
constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.
The following example shows how to develop and use a custom function to avoid navigation and header elements.
Import the beautifulsoup4
library and define the custom function.
SitemapLoader
object.