agentevals package provides prebuilt evaluators for agent trajectories. You can evaluate by performing a trajectory match (deterministic comparison) or by using an LLM judge (qualitative assessment):
| Approach | When to use |
|---|---|
| Trajectory match | You know the expected tool calls and want fast, deterministic, cost-free checks |
| LLM-as-judge | You want to assess overall quality and reasoning without strict expectations |
Install AgentEvals
Trajectory match evaluator
AgentEvals offers thecreate_trajectory_match_evaluator function to match your agent’s trajectory against a reference. There are four modes:
| Mode | Description | Use case |
|---|---|---|
strict | Exact match of message structure and tool calls in the same order (message content can differ) | Testing specific sequences (e.g., policy lookup before authorization) |
unordered | Same message structure and tool calls as reference, but tool calls can happen in any order | Verifying information retrieval when order doesn’t matter |
subset | Agent calls only tools from reference (no extras) | Ensuring agent doesn’t exceed expected scope |
superset | Agent calls at least the reference tools (extras allowed) | Verifying minimum required actions are taken |
get_weather tool:
Strict match
Strict match
The
strict mode ensures trajectories contain identical messages in the same order with the same tool calls, though it allows for differences in message content. This is useful when you need to enforce a specific sequence of operations, such as requiring a policy lookup before authorizing an action.Unordered match
Unordered match
The
unordered mode allows the same tool calls in any order. This is helpful when you want to verify that specific information was retrieved but don’t care about the sequence. For example, an agent that checks both weather and events for a city with different tool calls.Subset and superset match
Subset and superset match
The
superset and subset modes match partial trajectories. The superset mode verifies that the agent called at least the tools in the reference trajectory, allowing additional tool calls. The subset mode ensures the agent did not call any tools beyond those in the reference.You can also set the
tool_args_match_mode property and/or tool_args_match_overrides to customize how the evaluator considers equality between tool calls in the actual trajectory vs. the reference. By default, only tool calls with the same arguments to the same tool are considered equal. Visit the repository for more details.LLM-as-judge evaluator
You can use an LLM to evaluate the agent’s execution path with thecreate_trajectory_llm_as_judge function. Unlike trajectory match evaluators, it doesn’t require a reference trajectory, but one can be provided if available.
Without reference trajectory
Without reference trajectory
With reference trajectory
With reference trajectory
If you have a reference trajectory, use the prebuilt
TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE prompt:For more configurability over how the LLM evaluates the trajectory, visit the repository.
Async support
Allagentevals evaluators support Python asyncio. Async versions are available by adding async after create_ in the function name.
Async judge and evaluator example
Async judge and evaluator example
Run evals in LangSmith
For tracking experiments over time, log evaluator results to LangSmith. First, set the required environment variables:evaluate function.
Use pytest integration
Use pytest integration
Use the evaluate function
Use the evaluate function
Create a LangSmith dataset and use the
evaluate function. The dataset must have the following schema:- input:
{"messages": [...]}input messages to call the agent with. - output:
{"messages": [...]}expected message history in the agent output. For trajectory evaluation, you can choose to keep only assistant messages.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

