evaluate()
with two existing experiments to define an evaluator and run a pairwise evaluation. Finally, you’ll use the LangSmith UI to view the pairwise experiments.
langsmith
Python version >=0.2.0
or JS version >=0.2.9
.evaluate_comparative()
with more than two existing experiments.evaluate()
comparative argsevaluate
/ aevaluate
function takes the following arguments:
Argument | Description |
---|---|
target | A list of the two existing experiments you would like to evaluate against each other. These can be uuids or experiment names. |
evaluators | A list of the pairwise evaluators that you would like to attach to this evaluation. See the section below for how to define these. |
Argument | Description |
---|---|
randomize_order / randomizeOrder | An optional boolean indicating whether the order of the outputs should be randomized for each evaluation. This is a strategy for minimizing positional bias in your prompt: often, the LLM will be biased towards one of the responses based on the order. This should mainly be addressed via prompt engineering, but this is another optional mitigation. Defaults to False. |
experiment_prefix / experimentPrefix | A prefix to be attached to the beginning of the pairwise experiment name. Defaults to None. |
description | A description of the pairwise experiment. Defaults to None. |
max_concurrency / maxConcurrency | The maximum number of concurrent evaluations to run. Defaults to 5. |
client | The LangSmith client to use. Defaults to None. |
metadata | Metadata to attach to your pairwise experiment. Defaults to None. |
load_nested / loadNested | Whether to load all child runs for the experiment. When False, only the root trace will be passed to your evaluator. Defaults to False. |
inputs: dict
: A dictionary of the inputs corresponding to a single example in a dataset.outputs: list[dict]
: A two-item list of the dict outputs produced by each experiment on the given inputs.reference_outputs
/ referenceOutputs: dict
: A dictionary of the reference outputs associated with the example, if available.runs: list[Run]
: A two-item list of the full Run objects generated by the two experiments on the given example. Use this if you need access to intermediate steps or metadata about each run.example: Example
: The full dataset Example, including the example inputs, outputs (if available), and metadata (if available).inputs
, outputs
, and reference_outputs
/ referenceOutputs
. runs
and example
are useful only if you need some extra trace or example metadata outside of the actual inputs and outputs of the application.
dict
: dictionary with keys:
key
, which represents the feedback key that will be loggedscores
, which is a mapping from run ID to score for that run.comment
, which is a string. Most commonly used for model reasoning.list[int | float | bool]
: a two-item list of scores. The list is assumed to have the same order as the runs
/ outputs
evaluator args. The evaluator function name is used for the feedback key.pairwise_
or ranked_
.
langsmith>=0.2.0
langsmith>=0.2.9