Basic example
Evaluator args
code evaluator functions must have specific argument names. They can take any subset of the following arguments:run: Run: The full Run object generated by the application on the given example.example: Example: The full dataset Example, including the example inputs, outputs (if available), and metdata (if available).inputs: dict: A dictionary of the inputs corresponding to a single example in a dataset.outputs: dict: A dictionary of the outputs generated by the application on the giveninputs.reference_outputs/referenceOutputs: dict: A dictionary of the reference outputs associated with the example, if available.
inputs, outputs, and reference_outputs. run and example are useful only if you need some extra trace or example metadata outside of the actual inputs and outputs of the application.
When using JS/TS these should all be passed in as part of a single object argument.
Evaluator output
Code evaluators are expected to return one of the following types: Python and JS/TSdict: dicts of the form{"score" | "value": ..., "key": ...}allow you to customize the metric type (“score” for numerical and “value” for categorical) and metric name. This if useful if, for example, you want to log an integer as a categorical metric.
int | float | bool: this is interepreted as an continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.str: this is intepreted as a categorical metric. The function name is used as the name of the metric.list[dict]: return multiple metrics using a single function.
Additional examples
Requireslangsmith>=0.2.0
Related
- Evaluate aggregate experiment results: Define summary evaluators, which compute metrics for an entire experiment.
- Run an evaluation comparing two experiments: Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.