- CI/CD pipelines: Implement quality gates that fail builds if evaluation scores drop below a threshold.
- Local debugging: Inspect and analyze results without API calls.
- Custom aggregations: Calculate metrics and statistics using your own logic.
- Integration testing: Use evaluation results to gate merges or deployments.
ExperimentResults object returned by Client.evaluate().
This page focuses on processing results programmatically while still uploading them to LangSmith.If you want to run evaluations locally without recording anything to LangSmith (for quick testing or validation), refer to Run an evaluation locally which uses
upload_results=False.Iterate over evaluation results
Theevaluate() function returns an ExperimentResults object that you can iterate over. The blocking parameter controls when results become available:
blocking=False: Returns immediately with an iterator that yields results as they’re produced. This allows you to process results in real-time as the evaluation runs.blocking=True(default): Blocks until all evaluations complete before returning. When you iterate over the results, all data is already available.
ExperimentResults type; the difference is whether the function waits for completion before returning. Use blocking=False for streaming and real-time debugging, or blocking=True for batch processing when you need the complete dataset.
The following example demonstrates blocking=False. It iterates over results as they stream in, collects them in a list, then processes them in a separate loop:
Understand the result structure
Each result in the iterator contains:-
result["run"]: The execution of your target function.result["run"].inputs: The inputs from your dataset example.result["run"].outputs: The outputs produced by your target function.result["run"].id: The unique ID for this run.
-
result["evaluation_results"]["results"]: A list ofEvaluationResultobjects, one per evaluator.key: The metric name (from your evaluator’s return value).score: The numeric score (typically 0-1 or boolean).comment: Optional explanatory text.source_run_id: The ID of the evaluator run.
-
result["example"]: The dataset example that was evaluated.result["example"].inputs: The input values.result["example"].outputs: The reference outputs (if any).
Examples
Implement a quality gate
This example uses evaluation results to pass or fail a CI/CD build automatically based on quality thresholds. The script iterates through results, calculates an average accuracy score, and exits with a non-zero status code if the accuracy falls below 85%. This ensures that you can deploy code changes that meet quality standards.Batch processing with blocking=True
When you need to perform operations that require the complete dataset (like calculating percentiles, sorting by score, or generating summary reports), useblocking=True to wait for all evaluations to complete before processing:
blocking=True, your processing code runs only after all evaluations are complete, avoiding mixed output with evaluation logs.
For more information on running evaluations without uploading results, refer to Run an evaluation locally.
Related
- Evaluate your LLM application
- Run an evaluation locally
- Fetch performance metrics from an experiment