ArxivRetriever integration

The arXiv Retriever allows users to query the arXiv database for academic articles. It supports both full-document retrieval (PDF parsing) and summary-based retrieval. For detailed documentation of all ArxivRetriever features and configurations, head to the API reference

Features

Query Flexibility: Search using natural language queries or specific arXiv IDs.
Full-Document Retrieval: Option to fetch and parse PDFs.
Summaries as Documents: Retrieve summaries for faster results.
Customizable Options: Configure maximum results and output format.

Integration details

Retriever	Source	Package
`ArxivRetriever`	Academic articles from arXiv	`@langchain/community`

Setup

Ensure the following dependencies are installed:

pdf-parse for parsing PDFs
fast-xml-parser for parsing XML responses from the arXiv API

npm install pdf-parse fast-xml-parser

Instantiation

const retriever = new ArxivRetriever({
  getFullDocuments: false, // Set to true to fetch full documents (PDFs)
  maxSearchResults: 5, // Maximum number of results to retrieve
});

Usage

Use the invoke method to search arXiv for relevant articles. You can use either natural language queries or specific arXiv IDs.

const query = "quantum computing";

const documents = await retriever.invoke(query);
documents.forEach((doc) => {
  console.log("Title:", doc.metadata.title);
  console.log("Content:", doc.pageContent); // Parsed PDF content
});

Use within a chain

Like other retrievers, ArxivRetriever can be incorporated into LLM applications via chains. Below is an example of using the retriever within a chain:

import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import {
  RunnablePassthrough,
  RunnableSequence,
} from "@langchain/core/runnables";
import { StringOutputParser } from "@langchain/core/output_parsers";
import type { Document } from "@langchain/core/documents";

const llm = new ChatOpenAI({
  model: "gpt-4.1-mini",
  temperature: 0,
});

const prompt = ChatPromptTemplate.fromTemplate(`
Answer the question based only on the context provided.

Context: {context}

Question: {question}`);

const formatDocs = (docs: Document[]) => {
  return docs.map((doc) => doc.pageContent).join("\n\n");
};

const ragChain = RunnableSequence.from([
  {
    context: retriever.pipe(formatDocs),
    question: new RunnablePassthrough(),
  },
  prompt,
  llm,
  new StringOutputParser(),
]);

await ragChain.invoke("What are the latest advances in quantum computing?");

API reference

For detailed documentation of all ArxivRetriever features and configurations, head to the API reference

Retrieval guide

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

General integrations

RAG integrations

Features

Integration details

Setup

Instantiation

Usage

Use within a chain

API reference

Popular Providers

General integrations

RAG integrations

​Features

​Integration details

​Setup

​Instantiation

​Usage

​Use within a chain

​API reference

​Related

Features

Integration details

Setup

Instantiation

Usage

Use within a chain

API reference

Related