agentevals package provides evaluators specifically designed for testing agent trajectories with live models.
This guide covers the open source LangChain
agentevals package, which integrates with LangSmith for trajectory evaluation.Trajectory match
Hard-code a reference trajectory for a given input and validate the run via a step-by-step comparison.Ideal for testing well-defined workflows where you know the expected behavior. Use when you have specific expectations about which tools should be called and in what order. This approach is deterministic, fast, and cost-effective since it doesn’t require additional LLM calls.
LLM-as-judge
Use a LLM to qualitatively validate your agent’s execution trajectory. The “judge” LLM reviews the agent’s decisions against a prompt rubric (which can include a reference trajectory).More flexible and can assess nuanced aspects like efficiency and appropriateness, but requires an LLM call and is less deterministic. Use when you want to evaluate the overall quality and reasonableness of the agent’s trajectory without strict tool call or ordering requirements.
Installing AgentEvals
Trajectory match evaluator
AgentEvals offers thecreate_trajectory_match_evaluator function in Python and createTrajectoryMatchEvaluator in TypeScript to match your agent’s trajectory against a reference trajectory.
You can use the following modes:
| Mode | Description | Use Case |
|---|---|---|
strict | Exact match of messages and tool calls in the same order | Testing specific sequences (e.g., policy lookup before authorization) |
unordered | Same tool calls allowed in any order | Verifying information retrieval when order doesn’t matter |
subset | Agent calls only tools from reference (no extras) | Ensuring agent doesn’t exceed expected scope |
superset | Agent calls at least the reference tools (extras allowed) | Verifying minimum required actions are taken |
Strict match
Thestrict mode ensures trajectories contain identical messages in the same order with the same tool calls, though it allows for differences in message content. This is useful when you need to enforce a specific sequence of operations, such as requiring a policy lookup before authorizing an action.
Unordered match
Theunordered mode allows the same tool calls in any order, which is helpful when you want to verify that the correct set of tools are being invoked but don’t care about the sequence. For example, an agent might need to check both weather and events for a city, but the order doesn’t matter.
Subset and superset match
Thesuperset and subset modes focus on which tools are called rather than the order of tool calls, allowing you to control how strictly the agent’s tool calls must align with the reference.
- Use
supersetmode when you want to verify that a few key tools are called in the execution, but you’re okay with the agent calling additional tools. The agent’s trajectory must include at least all the tool calls in the reference trajectory, and may include additional tool calls beyond the reference. - Use
subsetmode to ensure agent efficiency by verifying that the agent did not call any irrelevant or unnecessary tools beyond those in the reference. The agent’s trajectory must include only tool calls that appear in the reference trajectory.
superset mode, where the reference trajectory only requires the get_weather tool, but the agent can call additional tools:
You can also customize how the evaluator considers equality between tool calls in the actual trajectory vs. the reference by setting the
tool_args_match_mode (Python) or toolArgsMatchMode (TypeScript) property, as well as the tool_args_match_overrides (Python) or toolArgsMatchOverrides (TypeScript) property. By default, only tool calls with the same arguments to the same tool are considered equal. Visit the repository for more details.LLM-as-judge evaluator
This section covers the trajectory-specific LLM-as-a-judge evaluator from the
agentevals package. For general-purpose LLM-as-a-judge evaluators in LangSmith, refer to the LLM-as-a-judge evaluator.Without reference trajectory
With reference trajectory
If you have a reference trajectory, you can add an extra variable to your prompt and pass in the reference trajectory. Below, we use the prebuiltTRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE prompt and configure the reference_outputs variable:
For more configurability over how the LLM evaluates the trajectory, visit the repository.
Async support (Python)
Allagentevals evaluators support Python asyncio. For evaluators that use factory functions, async versions are available by adding async after create_ in the function name.
Here’s an example using the async judge and evaluator:
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.