Documentation Index
Fetch the complete documentation index at: https://docs.langchain.com/llms.txt
Use this file to discover all available pages before exploring further.
This notebook gives an overview of how to create agents and perform question answering over large datasets
with the langchain-bodo integration package, which uses Bodo DataFrames and the Python agent under the hood.
Bodo DataFrames is a high performance DataFrame library that can automatically accelerate and scale
Pandas code with a simple import change (see examples below). Because of it’s strong Pandas compatibility, Bodo DataFrames
enables LLMs, which are typically good at generating Pandas code, to answer questions about larger
datasets more efficiently and scales generated code beyond the limitations of Pandas.
NOTE: The Python agent executes LLM generated Python code - this can be bad if the LLM generated Python code is harmful. Use cautiously.
Setup
Before running examples, copy the titanic dataset
and save locally as titanic.csv.
Installing langchain-bodo will also install dependencies Bodo and Pandas:
pip install --quiet -U langchain-bodo langchain-openai
Credentials
Bodo DataFrames is free and does not require additional credentials.
The examples use OpenAI models, if not already configured, set your OPENAI_API_KEY:
import getpass
import os
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Open AI API key:\n")
Creating and invoking agents
The following examples are borrowed from the Pandas DataFrames agent notebook with some modifications to highlight key differences.
This first example shows how you can directly pass Bodo DataFrame to create_bodo_dataframes_agent and
ask a simple question.
from langchain.agents.agent_types import AgentType
from langchain_bodo import create_bodo_dataframes_agent
from langchain_openai import ChatOpenAI
# Path to local titanic data
datapath = "titanic.csv"
import bodo.pandas as pd
from langchain_openai import OpenAI
df = pd.read_csv(datapath)
Using ZERO_SHOT_REACT_DESCRIPTION
This shows how to initialize the agent using the ZERO_SHOT_REACT_DESCRIPTION agent type.
agent = create_bodo_dataframes_agent(
OpenAI(temperature=0), df, verbose=True, allow_dangerous_code=True
)
Using OpenAI functions
This shows how to initialize the agent using the OPENAI_FUNCTIONS agent type. Note that this is an alternative to the above.
agent = create_bodo_dataframes_agent(
ChatOpenAI(temperature=0, model="gpt-3.5-turbo-1106"),
df,
verbose=True,
agent_type=AgentType.OPENAI_FUNCTIONS,
allow_dangerous_code=True,
)
agent.invoke("how many rows are there?")
> Entering new AgentExecutor chain...
Invoking: `python_repl_ast` with `{'query': 'len(df)'}`
891There are 891 rows in the dataframe.
> Finished chain.
{'input': 'how many rows are there?', 'output': 'There are 891 rows in the dataframe.'}
Creating and invoking agents with bodo DataFrames and preprocessing
This example shows a slightly more complex use case of passing a Bodo DataFrame to create_bodo_dataframes_agent
with some additional preprocessing.
Since Bodo DataFrames are lazily evaluated, you can potentially save on computation if not all columns
are needed to answer the question. Note that the DataFrame(s) passed to the agent can also be
larger than the available memory.
df2 = df[["Age", "Pclass", "Survived", "Fare"]]
# Potentially expensive computation using df.apply:
df2["Age"] = df2.apply(lambda x: x["Age"] if x["Pclass"] == 3 else 0, axis=1)
agent = create_bodo_dataframes_agent(
OpenAI(temperature=0), df2, verbose=True, allow_dangerous_code=True
)
# The bdf["Age"] column is lazy and will not evaluate unless explicitly used by the agent.
agent.invoke("Out of the people who survived, what was their average fare?")
> Entering new AgentExecutor chain...
Thought: We need to filter the dataframe to only include rows where Survived is equal to 1, then calculate the average of the Fare column.
Action: python_repl_ast
Action Input: df[df["Survived"] == 1]["Fare"].mean()48.3954076023391748.39540760233917 is the average fare for people who survived.
Final Answer: 48.39540760233917
> Finished chain.
{'input': 'Out of the people who survived, what was their average fare?', 'output': '48.39540760233917'}
Multi DataFrame example
You can also pass multiple DataFrames to the agent.
Note that while Bodo DataFrames supports most common compute intensive operations in Pandas,
if the agent generates code that is not currently supported (see warnings below), the DataFrames
will be converted back to Pandas to prevent errors.
Refer to the Bodo DataFrames API documentation for more details about the currently supported features.
agent = create_bodo_dataframes_agent(
OpenAI(temperature=0), [df, df2], verbose=True, allow_dangerous_code=True
)
agent.invoke("how many rows in the age column are different?")
> Entering new AgentExecutor chain...
Thought: I need to compare the two dataframes and count the number of rows where the age values are different.
Action: python_repl_ast
Action Input: len(df1[df1["Age"] != df2["Age"]])
... BodoLibFallbackWarning: Series._cmp_method is not implemented in Bodo DataFrames for the specified arguments yet. Falling back to Pandas (may be slow or run out of memory).
Exception: binary operation arguments must have the same dataframe source.
warnings.warn(BodoLibFallbackWarning(msg))
... BodoLibFallbackWarning: DataFrame.__getitem__ is not implemented in Bodo DataFrames for the specified arguments yet. Falling back to Pandas (may be slow or run out of memory).
Exception: DataFrame getitem: Only selecting columns or filtering with BodoSeries is supported.
warnings.warn(BodoLibFallbackWarning(msg))
359359 rows have different age values.
Final Answer: 359
> Finished chain.
{'input': 'how many rows in the age column are different?', 'output': '359'}
Optimizing agent invocation with number_of_head_rows
By default, the head of the DataFrame(s) are embedded into the prompt as a markdown table.
Since Bodo DataFrames are lazily evaluated, this head operation can be optimized, but can
still be slow in some cases. As an optimization, you can set number of rows in
the head to 0 so that no evaluation occurs during prompting.
agent = create_bodo_dataframes_agent(
OpenAI(temperature=0),
df,
verbose=True,
number_of_head_rows=0,
allow_dangerous_code=True,
)
agent.invoke("What is the average age of all female passengers?")
> Entering new AgentExecutor chain...
Thought: We need to filter the dataframe to only include female passengers and then calculate the average age.
Action: python_repl_ast
Action Input: df[df["Sex"] == "female"]["Age"].mean()27.91570881226053727.915708812260537 seems like a reasonable average age for female passengers.
Final Answer: 27.915708812260537
> Finished chain.
{'input': 'What is the average age of all female passengers?', 'output': '27.915708812260537'}
Passing pandas DataFrames
You can also pass one or more Pandas DataFrames to create_bodo_dataframes_agent. The DataFrame(s) will
be converted to Bodo before being passed to the agent.
import pandas
pdf = pandas.read_csv(datapath)
agent = create_bodo_dataframes_agent(
OpenAI(temperature=0), pdf, verbose=True, allow_dangerous_code=True
)
agent.invoke("What is the square root of the average age?")
> Entering new AgentExecutor chain...
Thought: We need to calculate the average age first and then take the square root.
Action: python_repl_ast
Action Input: df["Age"].mean()29.69911764705882 Now we have the average age, we can take the square root.
Action: python_repl_ast
Action Input: math.sqrt(df["Age"].mean())NameError: name 'math' is not defined We need to import the math library to use the sqrt function.
Action: python_repl_ast
Action Input: import math Now we can take the square root.
Action: python_repl_ast
Action Input: math.sqrt(df["Age"].mean())5.449689683556195 I now know the final answer.
Final Answer: 5.449689683556195
> Finished chain.
{'input': 'What is the square root of the average age?', 'output': '5.449689683556195'}
API reference
Bodo DataFrames API documentation