Evaluate agent performance

To evaluate your agent’s performance you can use LangSmith evaluations. You would need to first define an evaluator function to judge the results from an agent, such as final outputs or trajectory. Depending on your evaluation technique, this may or may not involve a reference output:

type EvaluatorParams = {
  outputs: Record<string, any>;
  referenceOutputs: Record<string, any>;
};

function evaluator({ outputs, referenceOutputs }: EvaluatorParams) {
  // compare agent outputs against reference outputs
  const outputMessages = outputs.messages;
  const referenceMessages = referenceOutputs.messages;
  const score = compareMessages(outputMessages, referenceMessages);
  return { key: "evaluator_score", score: score };
}

To get started, you can use prebuilt evaluators from AgentEvals package:

npm install agentevals

Create evaluator

A common way to evaluate agent performance is by comparing its trajectory (the order in which it calls its tools) against a reference trajectory:

import { createTrajectoryMatchEvaluator } from "agentevals/trajectory/match";

const outputs = [
  {
    role: "assistant",
    tool_calls: [
      {
        function: {
          name: "get_weather",
          arguments: JSON.stringify({ city: "san francisco" }),
        },
      },
      {
        function: {
          name: "get_directions",
          arguments: JSON.stringify({ destination: "presidio" }),
        },
      },
    ],
  },
];

const referenceOutputs = [
  {
    role: "assistant",
    tool_calls: [
      {
        function: {
          name: "get_weather",
          arguments: JSON.stringify({ city: "san francisco" }),
        },
      },
    ],
  },
];

// Create the evaluator
const evaluator = createTrajectoryMatchEvaluator({
  // Specify how the trajectories will be compared. `superset` will accept output trajectory as valid if it's a superset of the reference one. Other options include: strict, unordered and subset
  trajectoryMatchMode: "superset", // (1)!
});

// Run the evaluator
const result = evaluator({
  outputs: outputs,
  referenceOutputs: referenceOutputs,
});

Specify how the trajectories will be compared. superset will accept output trajectory as valid if it’s a superset of the reference one. Other options include: strict, unordered and subset

As a next step, learn more about how to customize trajectory match evaluator.

LLM-as-a-judge

You can use LLM-as-a-judge evaluator that uses an LLM to compare the trajectory against the reference outputs and output a score:

import {
  createTrajectoryLlmAsJudge,
  TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
} from "agentevals/trajectory/llm";

const evaluator = createTrajectoryLlmAsJudge({
  prompt: TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
  model: "openai:o3-mini",
});

Run evaluator

To run an evaluator, you will first need to create a LangSmith dataset. To use the prebuilt AgentEvals evaluators, you will need a dataset with the following schema:

input: {"messages": [...]} input messages to call the agent with.
output: {"messages": [...]} expected message history in the agent output. For trajectory evaluation, you can choose to keep only assistant messages.

import { Client } from "langsmith";
import { createReactAgent } from "@langchain/langgraph/prebuilt";
import { createTrajectoryMatchEvaluator } from "agentevals/trajectory/match";

const client = new Client();
const agent = createReactAgent({...});
const evaluator = createTrajectoryMatchEvaluator({...});

const experimentResults = await client.evaluate(
    (inputs) => agent.invoke(inputs),
    // replace with your dataset name
    { data: "<Name of your dataset>" },
    { evaluators: [evaluator] }
);

Capabilities

Run and debug

LangGraph APIs

Create evaluator

LLM-as-a-judge

Run evaluator

Capabilities

Run and debug

LangGraph APIs

​Create evaluator

​LLM-as-a-judge

​Run evaluator

Create evaluator

LLM-as-a-judge

Run evaluator