Skip to main content
Parallel is a real-time web search and content extraction platform designed specifically for LLMs and AI applications.
The ParallelExtractTool provides access to Parallel’s Extract API, which extracts clean, structured content from web pages.

Overview

Integration details

ClassPackageSerializableJS supportPackage latest
ParallelExtractToollangchain-parallel❌❌PyPI - Latest version

Tool features

  • Clean content extraction: Extracts main content from web pages, removing ads, navigation, and boilerplate
  • Markdown formatting: Returns content formatted as clean markdown
  • Batch processing: Extract from multiple URLs in a single API call
  • Metadata extraction: Includes title, publish date, and other metadata
  • Content length control: Configure maximum characters per extraction
  • Error handling: Gracefully handles failed extractions with detailed error information
  • Async support: Full async/await support for better performance

Setup

The integration lives in the langchain-parallel package.
pip install -qU langchain-parallel

Credentials

Head to Parallel to sign up and generate an API key. Once you’ve done this set the PARALLEL_API_KEY environment variable:
import getpass
import os

if not os.environ.get("PARALLEL_API_KEY"):
    os.environ["PARALLEL_API_KEY"] = getpass.getpass("Parallel API key:\n")

Instantiation

Here we show how to instantiate an instance of the ParallelExtractTool. The tool can be configured with API key and content length parameters:
from langchain_parallel import ParallelExtractTool

# Basic instantiation - API key from environment
tool = ParallelExtractTool()

# With explicit API key and custom settings
tool = ParallelExtractTool(
    api_key="your-api-key",
    base_url="https://api.parallel.ai",  # default value
    max_chars_per_extract=5000,  # Limit content length
)

Invocation

Invoke directly with args

You can invoke the tool with a list of URLs to extract content from:
# Extract from a single URL
result = tool.invoke(
    {"urls": ["https://en.wikipedia.org/wiki/Artificial_intelligence"]}
)

print(f"Extracted {len(result)} result(s)")
print(f"Title: {result[0]['title']}")
print(f"URL: {result[0]['url']}")
print(f"Content length: {len(result[0]['content'])} characters")
print(f"Content preview: {result[0]['content'][:200]}...")
# Extract from multiple URLs
result = tool.invoke(
    {
        "urls": [
            "https://en.wikipedia.org/wiki/Machine_learning",
            "https://en.wikipedia.org/wiki/Deep_learning",
            "https://en.wikipedia.org/wiki/Natural_language_processing",
        ]
    }
)

print(f"Extracted {len(result)} results")
for i, item in enumerate(result, 1):
    print(f"\n{i}. {item['title']}")
    print(f"   URL: {item['url']}")
    print(f"   Content length: {len(item['content'])} characters")

# Example response structure:
# [
#     {
#         "url": "https://example.com/article",
#         "title": "Article Title",
#         "content": "# Article Title\n\nMain content in markdown...",
#         "publish_date": "2024-01-15"  # Optional
#     }
# ]

Invoke with ToolCall

We can also invoke the tool with a model-generated ToolCall, in which case a ToolMessage will be returned:
# This is usually generated by a model, but we'll create a tool call directly for demo purposes.
model_generated_tool_call = {
    "args": {
        "urls": [
            "https://en.wikipedia.org/wiki/Climate_change",
            "https://en.wikipedia.org/wiki/Renewable_energy",
        ]
    },
    "id": "call_123",
    "name": tool.name,  # "parallel_extract"
    "type": "tool_call",
}

result = tool.invoke(model_generated_tool_call)
print(result)
print(f"Tool name: {tool.name}")  # parallel_extract
print(f"Tool description: {tool.description}")

Async usage

The tool supports full async/await operations for better performance in async applications:
async def extract_async():
    return await tool.ainvoke(
        {
            "urls": [
                "https://en.wikipedia.org/wiki/Python_(programming_language)",
                "https://en.wikipedia.org/wiki/JavaScript",
            ]
        }
    )


# Run async extraction
result = await extract_async()
print(f"Extracted {len(result)} results asynchronously")

Advanced features

The extract tool supports focused extraction with search objectives/queries, fetch policies, and fine-grained control over excerpts and full content:
# Extract focused excerpts with search objective
result = tool.invoke(
    {
        "urls": ["https://en.wikipedia.org/wiki/Artificial_intelligence"],
        "search_objective": "What are the main applications and ethical concerns of AI?",
        "excerpts": {"max_chars_per_result": 2000},
        "full_content": False,
    }
)

print(f"Extracted focused excerpts: {len(result[0].get('excerpts', []))} excerpts")
print(f"Content preview: {result[0]['content'][:200]}...")

# Extract with fetch policy for fresh content
result = tool.invoke(
    {
        "urls": ["https://en.wikipedia.org/wiki/Quantum_computing"],
        "fetch_policy": {
            "max_age_seconds": 86400,  # 1 day cache
            "timeout_seconds": 60,
            "disable_cache_fallback": False,
        },
        "full_content": {"max_chars_per_result": 5000},
    }
)

print(f"Content length: {len(result[0]['content'])} characters")

Error handling

The tool gracefully handles URLs that fail to extract, including them in results with error information:
# Mix of valid and invalid URLs
result = tool.invoke(
    {
        "urls": [
            "https://en.wikipedia.org/wiki/Artificial_intelligence",
            "https://this-domain-does-not-exist-12345.com/",
        ]
    }
)

for item in result:
    if "error_type" in item:
        print(f"Failed: {item['url']}")
        print(f"Error: {item['content']}")
    else:
        print(f"Success: {item['url']}")
        print(f"Extracted {len(item['content'])} characters")

Best practices

  • Batch URLs: Extract multiple URLs in a single call for better performance
  • Set content limits: Use max_chars_per_extract to control response size and token usage
  • Handle errors: Check for error_type in results to identify failed extractions
  • Use async for performance: Use ainvoke() in async applications for better performance
  • Metadata fields: Use publish_date and other metadata when available for context

Response format

The tool returns a list of dictionaries with the following format:
[
    {
        "url": "https://example.com/article",
        "title": "Article Title",
        "content": "# Article Title\n\nMain content formatted as markdown...",
        "publish_date": "2024-01-15"  # Optional, if available
    },
    # For failed extractions:
    {
        "url": "https://failed-site.com",
        "title": None,
        "content": "Error: 404 Not Found",
        "error_type": "http_error"
    }
]

API reference

For detailed documentation of all features and configuration options, head to the ParallelExtractTool API reference or the Parallel extract reference.
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.