Evaluation quickstart

Evaluations are a quantitative way to measure the performance of LLM applications. LLMs can behave unpredictably, even small changes to prompts, models, or inputs can significantly affect results. Evaluations provide a structured way to identify failures, compare versions, and build more reliable AI applications. Running an evaluation in LangSmith requires three key components:

Dataset: A set of test inputs (and optionally, expected outputs).
Target function: The part of your application you want to test—this might be a single LLM call with a new prompt, one module, or your entire workflow.
Evaluators: Functions that score your target function’s outputs.

This quickstart guides you through running a starter evaluation that checks the correctness of LLM responses, using either the LangSmith SDK or UI.

Prerequisites

Before you begin, make sure you have:

A LangSmith account: Sign up or log in at smith.langchain.com.
A LangSmith API key: Follow the Create an API key guide.
An OpenAI API key: Generate this from the OpenAI dashboard.

Select the UI or SDK filter for instructions:

1. Set workspace secrets

In the LangSmith UI, ensure that your API key is set as a workspace secret.

Navigate to Settings and then move to the Secrets tab.
Select Add secret and enter the key environment variable (e.g.,OPENAI_API_KEY or ANTHROPIC_API_KEY) and your API key as the Value.
Select Save secret.

When adding workspace secrets in the LangSmith UI, make sure the secret keys match the environment variable names expected by your model provider.

2. Create a prompt

The Playground makes it possible to run evaluations over different prompts, new models, or test different model configurations.

In the LangSmith UI, click Playground in the sidebar.
Under the Prompts panel, modify the system prompt to:
```
Answer the following question accurately:
```
Leave the Human message as is: {question}.

3. Create a dataset

Click Set up Evaluation, which will open a New Experiment table at the bottom of the page.
In the Select or create a new dataset dropdown, click the + New button to create a new dataset.

Add the following examples to the dataset:

Inputs	Reference Outputs
question: Which country is Mount Kilimanjaro located in?	output: Mount Kilimanjaro is located in Tanzania.
question: What is Earth’s lowest point?	output: Earth’s lowest point is The Dead Sea.

Click Save and enter a name to save your newly created dataset.

4. Add an evaluator

Click + Evaluator and select Correctness from the Prebuilt Evaluator options.
In the Correctness panel, click Save.

5. Run your evaluation

Select Start on the top right to run your evaluation. This will create an experiment with a preview in the New Experiment table. You can view in full by clicking the experiment name.

Next steps

To learn more about running experiments in LangSmith, read the evaluation conceptual guide.

For more details on evaluations, refer to the Evaluation documentation.
Learn how to create and manage datasets in the UI.
Learn how to run an evaluation from the Playground.

This guide uses prebuilt LLM-as-judge evaluators from the open-source openevals package. OpenEvals includes a set of commonly used evaluators and is a great starting point if you’re new to evaluations. If you want greater flexibility in how you evaluate your apps, you can also define completely custom evaluators.

1. Install dependencies

In your terminal, create a directory for your project and install the dependencies in your environment:

mkdir ls-evaluation-quickstart && cd ls-evaluation-quickstart
python -m venv .venv && source .venv/bin/activate
python -m pip install --upgrade pip
pip install -U langsmith openevals openai

If you are using yarn as your package manager, you will also need to manually install @langchain/core as a peer dependency of openevals. This is not required for LangSmith evals in general, you may define evaluators using arbitrary custom code.

2. Set up environment variables

Set the following environment variables:

LANGSMITH_TRACING
LANGSMITH_API_KEY
OPENAI_API_KEY (or your LLM provider’s API key)
(optional) LANGSMITH_WORKSPACE_ID: If your LangSmith API key is linked to multiple workspaces, set this variable to specify which workspace to use.

export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="<your-langsmith-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>"
export LANGSMITH_WORKSPACE_ID="<your-workspace-id>"

If you’re using Anthropic, use the Anthropic wrapper to trace your calls. For other providers, use the traceable wrapper.

3. Create a dataset

Create a file and add the following code, which will:

Import the Client to connect to LangSmith.
Create a dataset.
Define example inputs and outputs.
Associate the input and output pairs with that dataset in LangSmith so they can be used in evaluations.

# dataset.py
from langsmith import Client

def main():
    client = Client()

    # Programmatically create a dataset in LangSmith
    dataset = client.create_dataset(
        dataset_name="Sample dataset",
        description="A sample dataset in LangSmith."
    )

    # Create examples
    examples = [
        {
            "inputs": {"question": "Which country is Mount Kilimanjaro located in?"},
            "outputs": {"answer": "Mount Kilimanjaro is located in Tanzania."},
        },
        {
            "inputs": {"question": "What is Earth's lowest point?"},
            "outputs": {"answer": "Earth's lowest point is The Dead Sea."},
        },
    ]

    # Add examples to the dataset
    client.create_examples(dataset_id=dataset.id, examples=examples)
    print("Created dataset:", dataset.name)

if __name__ == "__main__":
    main()

In your terminal, run the dataset file to create the datasets you’ll use to evaluate your app:
python dataset.py
You’ll see the following output:
```
Created dataset: Sample dataset
```

4. Create your target function

Define a target function that contains what you’re evaluating. In this guide, you’ll define a target function that contains a single LLM call to answer a question.Add the following to an eval file:

# eval.py
from langsmith import Client, wrappers
from openai import OpenAI

# Wrap the OpenAI client for LangSmith tracing
openai_client = wrappers.wrap_openai(OpenAI())

# Define the application logic you want to evaluate inside a target function
# The SDK will automatically send the inputs from the dataset to your target function
def target(inputs: dict) -> dict:
    response = openai_client.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {"role": "system", "content": "Answer the following question accurately"},
            {"role": "user", "content": inputs["question"]},
        ],
    )
    return {"answer": response.choices[0].message.content.strip()}

5. Define an evaluator

In this step, you’re telling LangSmith how to grade the answers your app produces.Import a prebuilt evaluation prompt (CORRECTNESS_PROMPT) from openevals and a helper that wraps it into an LLM-as-judge evaluator, which will score the application’s output.

CORRECTNESS_PROMPT is just an f-string with variables for "inputs", "outputs", and "reference_outputs". See customizing OpenEvals prompts for more information.

The evaluator compares:

inputs: what was passed into your target function (e.g., the question text).
outputs: what your target function returned (e.g., the model’s answer).
reference_outputs: the ground truth answers you attached to each dataset example in Step 3.

Add the following highlighted code to your eval file:

from langsmith import Client, wrappers
from openai import OpenAI
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT

# Wrap the OpenAI client for LangSmith tracing
openai_client = wrappers.wrap_openai(OpenAI())

# Define the application logic you want to evaluate inside a target function
# The SDK will automatically send the inputs from the dataset to your target function
def target(inputs: dict) -> dict:
    response = openai_client.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {"role": "system", "content": "Answer the following question accurately"},
            {"role": "user", "content": inputs["question"]},
        ],
    )
    return {"answer": response.choices[0].message.content.strip()}

def correctness_evaluator(inputs: dict, outputs: dict, reference_outputs: dict):
    evaluator = create_llm_as_judge(
        prompt=CORRECTNESS_PROMPT,
        model="openai:o3-mini",
        feedback_key="correctness",
    )
    return evaluator(
        inputs=inputs,
        outputs=outputs,
        reference_outputs=reference_outputs
    )

6. Run and view results

To run the evaluation experiment, you’ll call evaluate(...), which:

Pulls example from the dataset you created in Step 3.
Sends each example’s inputs to your target function from Step 4.
Collects the outputs (the model’s answers).
Passes the outputs along with the reference_outputs to your evaluator from Step 5.
Records all results in LangSmith as an experiment, so you can view them in the UI.

Add the highlighted code to your eval file:

from langsmith import Client, wrappers
from openai import OpenAI
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT

# Wrap the OpenAI client for LangSmith tracing
openai_client = wrappers.wrap_openai(OpenAI())

# Define the application logic you want to evaluate inside a target function
# The SDK will automatically send the inputs from the dataset to your target function
def target(inputs: dict) -> dict:
    response = openai_client.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {"role": "system", "content": "Answer the following question accurately"},
            {"role": "user", "content": inputs["question"]},
        ],
    )
    return {"answer": response.choices[0].message.content.strip()}

def correctness_evaluator(inputs: dict, outputs: dict, reference_outputs: dict):
    evaluator = create_llm_as_judge(
        prompt=CORRECTNESS_PROMPT,
        model="openai:o3-mini",
        feedback_key="correctness",
    )
    return evaluator(
        inputs=inputs,
        outputs=outputs,
        reference_outputs=reference_outputs
    )

# After running the evaluation, a link will be provided to view the results in langsmith
def main():
    client = Client()
    experiment_results = client.evaluate(
        target,
        data="Sample dataset",
        evaluators=[
            correctness_evaluator,
            # can add multiple evaluators here
        ],
        experiment_prefix="first-eval-in-langsmith",
        max_concurrency=2,
    )
    print(experiment_results)

if __name__ == "__main__":
    main()

Run your evaluator:
python eval.py

You’ll receive a link to view the evaluation results and metadata for the experiment results:

View the evaluation results for experiment: 'first-eval-in-langsmith-00000000' at: https://smith.langchain.com/o/6551f9c4-2685-4a08-86b9-1b29643deb3d/datasets/e5fde557-c274-4e49-b39d-000000000000/compare?selectedSessions=70b11778-6a28-4cdb-be81-000000000000

<ExperimentResults first-eval-in-langsmith-00000000>

Follow the link in the output of your evaluation run to access the Datasets & Experiments page in the LangSmith UI, and explore the results of the experiment. This will direct you to the created experiment with a table showing the Inputs, Reference Output, and Outputs. You can select a dataset to open an expanded view of the results.

Next steps

Here are some topics you might want to explore next:

Evaluation concepts provides descriptions of the key terminology for evaluations in LangSmith.
OpenEvals README to see all available prebuilt evaluators and how to customize them.
Define custom evaluators.
Python or TypeScript SDK references for comprehensive descriptions of every class and function.

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

Prerequisites

1. Set workspace secrets

2. Create a prompt

3. Create a dataset

4. Add an evaluator

5. Run your evaluation

Next steps

1. Install dependencies

2. Set up environment variables

3. Create a dataset

4. Create your target function

5. Define an evaluator

6. Run and view results

Next steps

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

​Prerequisites

​1. Set workspace secrets

​2. Create a prompt

​3. Create a dataset

​4. Add an evaluator

​5. Run your evaluation

​Next steps

​1. Install dependencies

​2. Set up environment variables

​3. Create a dataset

​4. Create your target function

​5. Define an evaluator

​6. Run and view results

​Next steps

Prerequisites

1. Set workspace secrets

2. Create a prompt

3. Create a dataset

4. Add an evaluator

5. Run your evaluation

Next steps

1. Install dependencies

2. Set up environment variables

3. Create a dataset

4. Create your target function

5. Define an evaluator

6. Run and view results

Next steps