Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system called MediaWiki. Wikipedia
is the largest and most-read reference work in history.
This notebook shows how to retrieve wiki pages from wikipedia.org
into the Document format that is used downstream.
langchain-community
package. We also need to install the wikipedia
python package itself.
WikipediaRetriever
parameters include:
lang
: default=“en”. Use it to search in a specific language part of Wikipediaload_max_docs
: default=100. Use it to limit number of downloaded documents. It takes time to download all 100 documents, so use a small number for experiments. There is a hard limit of 300 for now.load_all_available_meta
: default=False. By default only the most important fields downloaded: Published
(date when document was published/last updated), title
, Summary
. If True, other fields also downloaded.get_relevant_documents()
has one argument, query
: free text which used to find documents in Wikipedia
WikipediaRetriever
can be incorporated into LLM applications via chains.
We will need a LLM or chat model:
WikipediaRetriever
features and configurations head to the API reference.