Set up online evaluators

Recommended ReadingBefore diving into this content, it might be helpful to read the following:

Running online evaluations

Online evaluations provide real-time feedback on your production traces. This is useful to continuously monitor the performance of your application—to identify issues, measure improvements, and ensure consistent quality over time. There are two types of online evaluations supported in LangSmith:

LLM-as-a-judge: Use an LLM to evaluate traces as a scalable substitute for human-like judgment (e.g., toxicity, hallucinations, correctness). Supports two different levels of granularity:
- Run level: Evaluate a single run.
- Thread level: Evaluate all traces in a thread.
Custom Code: Write an evaluator in Python directly in LangSmith. Often used for validating structure or statistical properties of your data.

When an online evaluator runs on any run within a trace, the trace will be auto-upgraded to extended data retention. This upgrade will impact trace pricing, but ensures that traces meeting your evaluation criteria (typically those most valuable for analysis) are preserved for investigation.

View online evaluators

Head to the Tracing Projects tab and select a tracing project. To view existing online evaluators for that project, click on the Evaluators tab.

Configure online evaluators

1. Navigate to online evaluators

Head to the Tracing Projects tab and select a tracing project. Click on + New in the top right corner of the tracing project page, then click on New Evaluator. Select the evaluator you want to configure.

2. Name your evaluator

3. Create a filter

For example, you may want to apply specific evaluators based on:

Runs where a user left feedback indicating the response was unsatisfactory.
Runs that invoke a specific tool call. See filtering for tool calls for more information.
Runs that match a particular piece of metadata (e.g. if you log traces with a plan_type and only want to run evaluations on traces from your enterprise customers). See adding metadata to your traces for more information.

Filters on evaluators work the same way as when you’re filtering traces in a project. For more information on filters, you can refer to this guide.

It’s often helpful to inspect runs as you’re creating a filter for your evaluator. With the evaluator configuration panel open, you can inspect runs and apply filters to them. Any filters you apply to the runs table will automatically be reflected in filters on your evaluator.

4. (Optional) Configure a sampling rate

Configure a sampling rate to control the percentage of filtered runs that trigger the automation action. For example, to control costs, you may want to set a filter to only apply the evaluator to 10% of traces. In order to do this, you would set the sampling rate to 0.1.

5. (Optional) Apply rule to past runs

Apply rule to past runs by toggling the Apply to past runs and entering a “Backfill from” date. This is only possible upon rule creation. Note: the backfill is processed as a background job, so you will not see the results immediately. In order to track progress of the backfill, you can view logs for your evaluator by heading to the Evaluators tab within a tracing project and clicking the Logs button for the evaluator you created. Online evaluator logs are similar to automation rule logs.

Add an evaluator name
Optionally filter runs that you would like to apply your evaluator on or configure a sampling rate.
Select Apply Evaluator

6. Select evaluator type

Configuring LLM-as-a-judge evaluators
Configuring custom code evaluators

Configure a LLM-as-a-judge online evaluator

View this guide to configure an LLM-as-a-judge evaluator.

Configure a custom code evaluator

Select custom code evaluator.

Write your evaluation function

Custom code evaluators restrictions.Allowed Libraries: You can import all standard library functions, as well as the following public packages:

numpy (v2.2.2): "numpy"
pandas (v1.5.2): "pandas"
jsonschema (v4.21.1): "jsonschema"
scipy (v1.14.1): "scipy"
sklearn (v1.26.4): "scikit-learn"

Network Access: You cannot access the internet from a custom code evaluator.

Custom code evaluators must be written inline. We recommend testing locally before setting up your custom code evaluator in LangSmith. In the UI, you will see a panel that lets you write your code inline, with some starter code:

Custom code evaluators take in one argument:

A Run (reference). This represents the sampled run to evaluate.

They return a single value:

Feedback(s) Dictionary: A dictionary whose keys are the type of feedback you want to return, and values are the score you will give for that feedback key. For example, {"correctness": 1, "silliness": 0} would create two types of feedback on the run, one saying it is correct, and the other saying it is not silly.

In the below screenshot, you can see an example of a simple function that validates that each run in the experiment has a known json field:

import json

def perform_eval(run):
  output_to_validate = run['outputs']
  is_valid_json = 0

  # assert you can serialize/deserialize as json
  try:
    json.loads(json.dumps(output_to_validate))
  except Exception as e:
    return { "formatted": False }

  # assert output facts exist
  if "facts" not in output_to_validate:
    return { "formatted": False }

  # assert required fields exist
  if "years_mentioned" not in output_to_validate["facts"]:
    return { "formatted": False }

  return {"formatted": True}

Test and save your evaluation function

Before saving, you can test your evaluator function on a recent run by clicking Test Code to make sure that your code executes properly. Once you Save, your online evaluator will run over newly sampled runs (or backfilled ones too if you chose the backfill option). If you prefer a video tutorial, check out the Online Evaluations video from the Introduction to LangSmith Course.

Video guide

Configure multi-turn online evaluators

Multi-turn online evaluators allow you to evaluate entire conversations between a human and an agent — not just individual exchanges. They measure end-to-end interaction quality across all turns in a thread. You can use multi-turn evaluations to measure:

Semantic Intent: What the user was trying to do.
Semantic Outcome: What actually happened, did the task succeed.
Trajectory: How the conversation unfolded, including trajectory of tool calls.

Running multi-turn online evals will auto-upgrade each trace within a thread to extended data retention. This upgrade will impact trace pricing, but ensures that traces meeting your evaluation criteria (typically those most valuable for analysis) are preserved for investigation.

Prerequisites

Your tracing project must be using threads.
The top-level inputs and outputs of each trace in a thread must have a messages key that contains a list of messages. We support messages in LangChain, OpenAI Chat Completions, and Anthropic Messages formats.
- If the top-level inputs and outputs of each trace only contain the latest message in the conversation, LangSmith will automatically combine messages across turns into a thread.
- If the top-level inputs and outputs of each trace contain the full conversation history, LangSmith will use that directly.

If your traces don’t follow the format above, thread level evaluators won’t work. You’ll need to update how you trace to LangSmith to ensure each trace’s top-level inputs and outputs contain a list of messages.Please refer to the troubleshooting section for more information.

Configuration

Navigate to the Tracing Projects tab and select a tracing project.
Click + New in the top right corner of the tracing project page > New Evaluator > Evaluate a multi-turn thread.
Name your evaluator.
Apply filters or a sampling rate.
Use filters or sampling to control evaluator cost. For example, evaluate only threads under N turns or sample 10% of all threads.
Configure an idle time.
The first time you configure a thread level evaluator, you’ll define the idle time — the amount of time after the last trace in a thread before it’s considered complete and ready for evaluation. This value should reflect the expected length of user interactions in your app. It applies across all evaluators in the project.

When first testing your evaluator, use a short idle time so you can see results quickly. Once validated, increase it to match the expected length of user interactions.

Configure your model.
Select the provider and model you want to use for your evaluator. Threads tend to get long, so you should use a model with a higher context window in order to avoid running into limits. For example, OpenAI’s GPT-4.1 mini or Gemini 2.5 Flash are good options as they both have 1M+ token context windows.
Configure your LLM-as-a-judge prompt.
Define what you want to evaluate. This prompt will be used to evaluate the thread. You can also configure which parts of the messages list are passed to the evaluator to control the content it receives:
- All messages: Send the full message list.
- Human and AI pairs: Send only user and assistant messages (excluding system messages, tool calls, etc.).
- First human and last AI: Send only the first user message and the last assistant reply.
Set up your feedback configuration.
Configure a name for the feedback key, the format for the feedback you want to collect and optionally enable reasoning on the feedback.

We don’t recommend using the same feedback key for a thread-level evaluator and a run-level evaluator as it can be hard to distinguish between the two.

Save your evaluator.

After saving, your evaluator will appear in the Evaluators tab. You can test it once the idle time has passed for any new threads created after saving.

Limits

These are the current limits for multi-turn online evaluators (subject to change). Please reach out if you are running into any of these limits.

Runs must be less than one week old: When a thread becomes idle, only runs within the past 7 days are eligible for evaluation.
Maximum of 500 threads evaluated at once: If you have more than 500 threads marked as idle in a five minute period, we will automatically sample beyond 500.
Maximum of 10 multi-turn online evaluators per workspace

Troubleshooting

Checking the status of your evaluator
You can check when your evaluator was last run by heading to the Evaluators tab within a tracing project and clicking the Logs button for the evaluator you created to view its run history. Inspect the data sent to the evaluator
Inspect the data sent to the evaluator by heading to the Evaluators tab within a tracing project, clicking on the evaluator you created and clicking the Evaluator traces tab. In this tab, you can see the inputs passed into the LLM-as-a-judge evaluator. If your messages are not being passed in correctly, you will see blank values in the inputs. This can happen if your messages are not formatted in one of the expected formats.

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Tracing setup

Configuration & troubleshooting

Viewing & managing traces

Automations

Feedback & evaluation

Monitoring & alerting

Data type reference

View online evaluators

Configure online evaluators

1. Navigate to online evaluators

2. Name your evaluator

3. Create a filter

4. (Optional) Configure a sampling rate

5. (Optional) Apply rule to past runs

6. Select evaluator type

Configure a LLM-as-a-judge online evaluator

Configure a custom code evaluator

Write your evaluation function

Test and save your evaluation function

Video guide

Configure multi-turn online evaluators

Prerequisites

Configuration

Limits

Troubleshooting

Tracing setup

Configuration & troubleshooting

Viewing & managing traces

Automations

Feedback & evaluation

Monitoring & alerting

Data type reference

​View online evaluators

​Configure online evaluators

​1. Navigate to online evaluators

​2. Name your evaluator

​3. Create a filter

​4. (Optional) Configure a sampling rate

​5. (Optional) Apply rule to past runs

​6. Select evaluator type

​Configure a LLM-as-a-judge online evaluator

​Configure a custom code evaluator

​Write your evaluation function

​Test and save your evaluation function

​Video guide

​Configure multi-turn online evaluators

​Prerequisites

​Configuration

​Limits

​Troubleshooting

View online evaluators

Configure online evaluators

1. Navigate to online evaluators

2. Name your evaluator

3. Create a filter

4. (Optional) Configure a sampling rate

5. (Optional) Apply rule to past runs

6. Select evaluator type

Configure a LLM-as-a-judge online evaluator

Configure a custom code evaluator

Write your evaluation function

Test and save your evaluation function

Video guide

Configure multi-turn online evaluators

Prerequisites

Configuration

Limits

Troubleshooting