key
: The name of the metric.score
| value
: The value of the metric. Use score
if it’s a numerical metric and value
if it’s categorical.comment
(optional): The reasoning or additional string information justifying the score.num_repetitions
argument to evaluate
/ aevaluate
(Python, TypeScript). Repeating the experiment involves both re-running the target function to generate outputs and re-running the evaluators.
To learn more about running repetitions on experiments, read the how-to-guide.
max_concurrency
argument to evaluate
/ aevaluate
, you can specify the concurrency of your experiment. The max_concurrency
argument has slightly different semantics depending on whether you are using evaluate
or aevaluate
.
evaluate
max_concurrency
argument to evaluate
specifies the maximum number of concurrent threads to use when running the experiment. This is both for when running your target function as well as your evaluators.
aevaluate
max_concurrency
argument to aevaluate
is fairly similar to evaluate
, but instead uses a semaphore to limit the number of concurrent tasks that can run at once. aevaluate
works by creating a task for each example in the dataset. Each task consists of running the target function as well as all of the evaluators on that specific example. The max_concurrency
argument specifies the maximum number of concurrent tasks, or put another way - examples, to run at once.
LANGSMITH_TEST_CACHE
to a valid folder on your device with write access. This will cause the API calls made in your experiment to be cached to disk, meaning future experiments that make the same API calls will be greatly sped up.
pytest
or vitest/jest
out of convenience.
pytest
and Vitest/Jest
Vitest/Jest
. These make it easy to: