AgentEvals package:
Create evaluator
A common way to evaluate agent performance is by comparing its trajectory (the order in which it calls its tools) against a reference trajectory:- Specify how the trajectories will be compared.
supersetwill accept output trajectory as valid if it’s a superset of the reference one. Other options include: strict, unordered and subset
LLM-as-a-judge
You can use LLM-as-a-judge evaluator that uses an LLM to compare the trajectory against the reference outputs and output a score:Run evaluator
To run an evaluator, you first need to create a LangSmith dataset. To use the prebuilt AgentEvals evaluators, you must have a dataset with the following schema:- input:
{"messages": [...]}input messages to call the agent with. - output:
{"messages": [...]}expected message history in the agent output. For trajectory evaluation, you can choose to keep only assistant messages.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.