Code evaluators are functions that take a dataset example and the resulting application output, and return one or more metrics. These functions can be passed directly into theDocumentation Index
Fetch the complete documentation index at: https://docs.langchain.com/llms.txt
Use this file to discover all available pages before exploring further.
evaluate() or aevaluate() functions.
Basic example
Evaluator args
code evaluator functions must have specific argument names. They can take any subset of the following arguments:run: Run: The full Run object generated by the application on the given example.example: Example: The full dataset Example, including the example inputs, outputs (if available), and metadata (if available).inputs: dict: A dictionary of the inputs corresponding to a single example in a dataset.outputs: dict: A dictionary of the outputs generated by the application on the giveninputs.reference_outputs/referenceOutputs: dict: A dictionary of the reference outputs associated with the example, if available.
inputs, outputs, and reference_outputs. run and example are useful only if you need some extra trace or example metadata outside of the actual inputs and outputs of the application.
When using JS/TS these should all be passed in as part of a single object argument.
Evaluator output
Code evaluators are expected to return one of the following types: Python and JS/TSdict: dicts of the form{"score" | "value": ..., "key": ...}allow you to customize the metric type (“score” for numerical and “value” for categorical) and metric name. This if useful if, for example, you want to log an integer as a categorical metric.
int | float | bool: this is interpreted as a continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.str: this is interpreted as a categorical metric. The function name is used as the name of the metric.list[dict]: return multiple metrics using a single function.
Additional examples
Requireslangsmith>=0.2.0
Related
- Evaluate aggregate experiment results: Define summary evaluators, which compute metrics for an entire experiment.
- Run an evaluation comparing two experiments: Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

