requests library, but the same principles apply to any language.
Before diving into this content, it might be helpful to read the following:
- Evaluate LLM applications.
- LangSmith API Reference: Complete API documentation for all endpoints used in this guide.
Create a dataset
For this example, we use the Python SDK to create a dataset quickly. To create datasets via the API or UI instead, refer to Managing datasets.Run a single experiment
To run an experiment via the API, you’ll need to:- Fetch the examples from your dataset.
- Create an experiment (also called a “session” in the API).
- For each example, create runs that reference both the example and the experiment.
- Close the experiment by setting its
end_time.
/examples endpoint:
- Creating run objects via POST to
/runswithreference_example_idandsession_idset. - Tracking parent-child relationships between runs (e.g., a parent “chain” run containing a child “llm” run).
- Updating runs with outputs via PATCH to
/runs/{run_id}.
reference_dataset_id. The key difference from regular tracing is that runs in an experiment must have a reference_example_id that links each run to a specific example in the dataset.
Add evaluation feedback
After running your experiments, you’ll typically want to evaluate the results by adding feedback scores. This allows you to track metrics like correctness, accuracy, or any custom evaluation criteria. In this example, the evaluation checks if each model’s output matches the expected label in the dataset. The code posts a “correctness” score (1.0 for correct, 0.0 for incorrect) to track how accurately each model classifies toxic vs. non-toxic text. The following code adds feedback to the runs from the single experiment example:Run a pairwise experiment
Next, we’ll demonstrate how to run a pairwise experiment. In a pairwise experiment, you compare two examples against each other. For more information, check out this guide.Connect these docs to Claude, VSCode, and more via MCP for real-time answers.