How to evaluate with OpenTelemetry

This guide shows you how to run an evaluation using OpenTelemetry tracing with LangSmith.

Evaluations | Datasets | Trace with OpenTelemetry

If you’re already using OpenTelemetry for tracing your LLM application, you can run evaluations by routing traces to an experiment session. This approach is useful when you want to evaluate applications that are instrumented with OpenTelemetry but don’t use the LangSmith SDK’s evaluate() function.

Overview

When evaluating with OpenTelemetry, you need to:

Create an experiment session in LangSmith.
Configure OpenTelemetry to send traces to LangSmith.
Add specific span attributes to link traces to the experiment and dataset examples.
Run your application for each example in the dataset.

Prerequisites

This guide assumes you have:

An application instrumented with OpenTelemetry that sends traces to LangSmith.
A dataset created in LangSmith with examples to evaluate. You can create a dataset via the LangSmith UI or via the SDK.

This tutorial uses Strands agents as example implementations, but the approach works with any OpenTelemetry-instrumentation. Install dependencies:

pip install langsmith strands-agents strands-agents-tools opentelemetry-sdk opentelemetry-exporter-otlp

Set the following environment variables:

# Tracing configuration
LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
LANGSMITH_API_KEY="<your-langsmith-api-key>"
OTEL_EXPORTER_OTLP_ENDPOINT = "https://api.smith.langchain.com/otel/"

# AWS Credentials
AWS_ACCESS_KEY_ID="<your-aws-access-key-id>"
AWS_SECRET_ACCESS_KEY="<your-aws-secret-access-key>"
AWS_REGION_NAME="<your-aws-region>"

If you’re self-hosting LangSmith, replace OTEL_EXPORTER_OTLP_ENDPOINT with your self-hosted URL and append /api/v1/otel. For example: OTEL_EXPORTER_OTLP_ENDPOINT = "https://ai-company.com/api/v1/otel".Replace LANGSMITH_ENDPOINT with your LangSmith API endpoint. For example: LANGSMITH_ENDPOINT = "https://ai-company.com/api/v1".

Step 1. Create an experiment session

This guide assumes that a dataset has been created in LangSmith with examples to evaluate. You can create a dataset via the LangSmith UI or via the SDK. An experiment session groups all evaluation traces together. Create one using the LangSmith client:

from langsmith import Client

# Initialize LangSmith client
client = Client()

experiment_name = "strands-agent-experiment"
# Assumes a dataset has been created. You can find the dataset ID in the LangSmith UI or via the SDK.
dataset_id = "<your-dataset-id>"

# Create an experiment session linked to the dataset
project = client.create_project(
    project_name=experiment_name,
    reference_dataset_id=dataset_id
)

experiment_id = str(project.id)

Additionally, you can create evaluators in the LangSmith UI and bind them to your dataset. For evaluators defined in the UI and bound to your dataset, they will automatically run on experiment traces. To learn more about evaluators, see Evaluators.

Step 2. Define an application and configure OpenTelemetry

First, you need an application that uses OpenTelemetry for tracing. This example uses a Strands agent, but you can use any OpenTelemetry-instrumented application. Set up OpenTelemetry to route traces to your experiment session by including the experiment ID in the OTEL headers. The general idea in this step is to have an agent or application that has been instrumented with OpenTelemetry.

TypeScript examples are not provided for this step as the Strands TypeScript SDK does not currently support OpenTelemetry observability (as of February 2026).

import os
from strands import Agent
from strands_tools import file_read, file_write, python_repl, shell, journal
from strands.telemetry import StrandsTelemetry

# Set OTEL headers with experiment ID as the project
api_key = os.getenv('LANGSMITH_API_KEY')
os.environ['OTEL_EXPORTER_OTLP_HEADERS'] = f"x-api-key={api_key},Langsmith-Project={experiment_id}"

# Initialize telemetry
strands_telemetry = StrandsTelemetry()
strands_telemetry.setup_otlp_exporter()

# Create an agent (Strands automatically creates OTel spans)
agent = Agent(
    tools=[file_read, file_write, python_repl, shell, journal],
    system_prompt="You are an Expert Software Developer.",
    model="us.anthropic.claude-sonnet-4-20250514-v1:0",
)

For details on setting up OpenTelemetry tracing with LangSmith, see Trace with OpenTelemetry.

Step 3. Set up key span attributes

Add the required span attributes to each application run. These attributes link each trace to the experiment and the specific dataset example. The following attributes are relevant for experiment evaluation:

Attribute	Purpose
`langsmith.trace.session_id`	Routes the trace to your experiment session
`langsmith.reference_example_id`	Links the trace to a specific dataset example
`langsmith.span.kind`	Sets the span type (e.g., “llm”, “chain”, “tool”)
`inputs`	Records the input to your application
`outputs`	Records the output from your application

For a complete list of supported OpenTelemetry attributes, see Trace with OpenTelemetry.

from opentelemetry import trace

def evaluate_with_opentelemetry(agent, example_id: str, example_input: str, experiment_id: str):
    tracer = trace.get_tracer(__name__)

    # Wrapper span to add experiment metadata
    with tracer.start_as_current_span("experiment_evaluation") as span:
        # Route trace to the experiment
        span.set_attribute("langsmith.trace.session_id", experiment_id)

        # Link trace to the specific dataset example
        span.set_attribute("langsmith.reference_example_id", example_id)

        # Record input
        span.set_attribute("inputs", example_input)

        # Run the application
        response = agent(example_input)

        # Record output
        output_text = getattr(response, "output", str(response))
        span.set_attribute("outputs", output_text)

        return output_text

Step 4. Run evaluation by iterating through dataset examples

Each experiment run creates traces in LangSmith that are linked to your dataset examples.

# Iterate through dataset examples
for example in client.list_examples(dataset_name=dataset_name):

    # Extract input from the example inputs dictionary
    # Adjust the key based on your dataset structure
    # (e.g., "input", "question", etc.)
    example_input = example.inputs.get("input")

    evaluate_with_opentelemetry(
        agent=agent,
        example_id=str(example.id),
        example_input=str(example_input),
        experiment_id=experiment_id
    )

After running the evaluation, you can analyze the experiment in the LangSmith UI to see:

Individual trace details for each example
Evaluator scores and feedback
Comparisons between different experiment runs

Navigate to your experiment in the LangSmith UI to analyze the results.

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

How to evaluate with OpenTelemetry

Overview

Prerequisites

Step 1. Create an experiment session

Step 2. Define an application and configure OpenTelemetry

Step 3. Set up key span attributes

Step 4. Run evaluation by iterating through dataset examples

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

​Overview

​Prerequisites

​Step 1. Create an experiment session

​Step 2. Define an application and configure OpenTelemetry

​Step 3. Set up key span attributes

​Step 4. Run evaluation by iterating through dataset examples

Overview

Prerequisites

Step 1. Create an experiment session

Step 2. Define an application and configure OpenTelemetry

Step 3. Set up key span attributes

Step 4. Run evaluation by iterating through dataset examples