Recommended ReadingBefore diving into this content, it might be helpful to read the following:
- Running online evaluations
- LLM-as-a-judge: Use an LLM to evaluate traces as a scalable substitute for human-like judgment (e.g., toxicity, hallucinations, correctness). Supports two different levels of granularity:
- Run level: Evaluate a single run.
- Thread level: Evaluate all traces in a thread.
- Custom Code: Write an evaluator in Python directly in LangSmith. Often used for validating structure or statistical properties of your data.
View online evaluators
Head to the Tracing Projects tab and select a tracing project. To view existing online evaluators for that project, click on the Evaluators tab.
Configure online evaluators
1. Navigate to online evaluators
Head to the Tracing Projects tab and select a tracing project. Click on + New in the top right corner of the tracing project page, then click on New Evaluator. Select the evaluator you want to configure.2. Name your evaluator
3. Create a filter
For example, you may want to apply specific evaluators based on:- Runs where a user left feedback indicating the response was unsatisfactory.
- Runs that invoke a specific tool call. See filtering for tool calls for more information.
- Runs that match a particular piece of metadata (e.g. if you log traces with a
plan_type
and only want to run evaluations on traces from your enterprise customers). See adding metadata to your traces for more information.
It’s often helpful to inspect runs as you’re creating a filter for your evaluator. With the evaluator configuration panel open, you can inspect runs and apply filters to them. Any filters you apply to the runs table will automatically be reflected in filters on your evaluator.
4. (Optional) Configure a sampling rate
Configure a sampling rate to control the percentage of filtered runs that trigger the automation action. For example, to control costs, you may want to set a filter to only apply the evaluator to 10% of traces. In order to do this, you would set the sampling rate to 0.1.5. (Optional) Apply rule to past runs
Apply rule to past runs by toggling the Apply to past runs and entering a “Backfill from” date. This is only possible upon rule creation. Note: the backfill is processed as a background job, so you will not see the results immediately. In order to track progress of the backfill, you can view logs for your evaluator by heading to the Evaluators tab within a tracing project and clicking the Logs button for the evaluator you created. Online evaluator logs are similar to automation rule logs.- Add an evaluator name
- Optionally filter runs that you would like to apply your evaluator on or configure a sampling rate.
- Select Apply Evaluator
6. Select evaluator type
- Configuring LLM-as-a-judge evaluators
- Configuring custom code evaluators
Configure a LLM-as-a-judge online evaluator
View this guide to configure an LLM-as-a-judge evaluator.Configure a custom code evaluator
Select custom code evaluator.Write your evaluation function
Custom code evaluators restrictions.Allowed Libraries: You can import all standard library functions, as well as the following public packages:Network Access: You cannot access the internet from a custom code evaluator.

- A
Run
(reference). This represents the sampled run to evaluate.
- Feedback(s) Dictionary: A dictionary whose keys are the type of feedback you want to return, and values are the score you will give for that feedback key. For example,
{"correctness": 1, "silliness": 0}
would create two types of feedback on the run, one saying it is correct, and the other saying it is not silly.
Test and save your evaluation function
Before saving, you can test your evaluator function on a recent run by clicking Test Code to make sure that your code executes properly. Once you Save, your online evaluator will run over newly sampled runs (or backfilled ones too if you chose the backfill option). If you prefer a video tutorial, check out the Online Evaluations video from the Introduction to LangSmith Course.Video guide
Configure multi-turn online evaluators
Multi-turn online evaluators allow you to evaluate entire conversations between a human and an agent — not just individual exchanges. They measure end-to-end interaction quality across all turns in a thread. You can use multi-turn evaluations to measure:- Semantic Intent: What the user was trying to do.
- Semantic Outcome: What actually happened, did the task succeed.
- Trajectory: How the conversation unfolded, including trajectory of tool calls.
Prerequisites
- Your tracing project must be using threads.
- The top-level inputs and outputs of each trace in a thread must have a
messages
key that contains a list of messages. We support messages in LangChain, OpenAI Chat Completions, and Anthropic Messages formats.- If the top-level inputs and outputs of each trace only contain the latest message in the conversation, LangSmith will automatically combine messages across turns into a thread.
- If the top-level inputs and outputs of each trace contain the full conversation history, LangSmith will use that directly.
If your traces don’t follow the format above, thread level evaluators won’t work. You’ll need to update how you trace to LangSmith to ensure each trace’s top-level inputs and outputs contain a list of
messages
.Please refer to the troubleshooting section for more information.Configuration
- Navigate to the Tracing Projects tab and select a tracing project.
- Click + New in the top right corner of the tracing project page > New Evaluator > Evaluate a multi-turn thread.
- Name your evaluator.
- Apply filters or a sampling rate.
Use filters or sampling to control evaluator cost. For example, evaluate only threads under N turns or sample 10% of all threads. - Configure an idle time.
The first time you configure a thread level evaluator, you’ll define the idle time — the amount of time after the last trace in a thread before it’s considered complete and ready for evaluation. This value should reflect the expected length of user interactions in your app. It applies across all evaluators in the project.
When first testing your evaluator, use a short idle time so you can see results quickly. Once validated, increase it to match the expected length of user interactions.
-
Configure your model.
Select the provider and model you want to use for your evaluator. Threads tend to get long, so you should use a model with a higher context window in order to avoid running into limits. For example, OpenAI’s GPT-4.1 mini or Gemini 2.5 Flash are good options as they both have 1M+ token context windows. -
Configure your LLM-as-a-judge prompt.
Define what you want to evaluate. This prompt will be used to evaluate the thread. You can also configure which parts of themessages
list are passed to the evaluator to control the content it receives:- All messages: Send the full message list.
- Human and AI pairs: Send only user and assistant messages (excluding system messages, tool calls, etc.).
- First human and last AI: Send only the first user message and the last assistant reply.
-
Set up your feedback configuration.
Configure a name for the feedback key, the format for the feedback you want to collect and optionally enable reasoning on the feedback.
We don’t recommend using the same feedback key for a thread-level evaluator and a run-level evaluator as it can be hard to distinguish between the two.
- Save your evaluator.
Limits
These are the current limits for multi-turn online evaluators (subject to change). Please reach out if you are running into any of these limits.- Runs must be less than one week old: When a thread becomes idle, only runs within the past 7 days are eligible for evaluation.
- Maximum of 500 threads evaluated at once: If you have more than 500 threads marked as idle in a five minute period, we will automatically sample beyond 500.
- Maximum of 10 multi-turn online evaluators per workspace
Troubleshooting
Checking the status of your evaluatorYou can check when your evaluator was last run by heading to the Evaluators tab within a tracing project and clicking the Logs button for the evaluator you created to view its run history. Inspect the data sent to the evaluator
Inspect the data sent to the evaluator by heading to the Evaluators tab within a tracing project, clicking on the evaluator you created and clicking the Evaluator traces tab. In this tab, you can see the inputs passed into the LLM-as-a-judge evaluator. If your messages are not being passed in correctly, you will see blank values in the inputs. This can happen if your messages are not formatted in one of the expected formats.
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.