- LangSmith evaluations: Record every Harbor job to LangSmith as an experiment with
--plugin langsmith. - Deep Agents: Run a LangGraph or Deep Agents application as the Harbor agent with
--agent langgraph. - Sandboxes: Run each Harbor trial on a LangSmith sandbox with
--env langsmith.
harbor run --help or see the Harbor documentation.
Prerequisites
- A LangSmith account and an API key.
- Python with
pip. - A provider API key for the model your agent calls, such as
ANTHROPIC_API_KEY.
Install
Install Harbor with thelangsmith extra. The extra includes the harbor-langsmith package used by the LangSmith plugin, environment, and agent:
Authenticate
Harbor authenticates with your LangSmith credentials. Set an API key:Quickstart
Record a Harbor job to LangSmith as an experiment:<agent> with a Harbor agent, and <provider:model> with a model in provider:model format that an installed langchain-* provider can resolve, for example anthropic:claude-opus-4-8. Run harbor run --help to list the available agents, or see Deep Agents for a complete langgraph run.
Open Datasets & Experiments, select the dataset Harbor synced, such as terminal-bench@2.0, then open the Experiments tab to view the run.
LangSmith evaluations
The LangSmith plugin records every Harbor job to LangSmith, so you can view and compare results under Datasets & Experiments. The plugin works with any Harbor agent, not only Deep Agents. Enable it with--plugin langsmith. The Quickstart shows the basic invocation, and this section covers what the plugin records and how to configure it.
Choose an agent that traces to LangSmith to capture full agent traces alongside the experiment. If the agent does not trace to LangSmith, the plugin still creates the dataset and the experiment with results and feedback, without the agent trace.
Pass the full import path instead of the short plugin name when you need to disambiguate it:
LANGSMITH_API_KEY.
See what the plugin records
As the job runs, the plugin writes to LangSmith over the API:- Dataset: Syncs a reference dataset from the job. The default name comes from the dataset or task, for example
terminal-bench@2.0. Each task becomes an example whose inputs are the task name, the instruction, and the task ID. - Experiment: Creates one experiment per job, named
<name>-<job-id-prefix>, linked to the reference dataset. - Runs: Creates a root run per trial with inputs for the task name, instruction, agent, and model, plus child runs for the environment, agent, and verification phases.
- Feedback: Attaches one feedback score per verifier reward key, such as
reward, and aharbor_errorfeedback when a trial raises an exception. - Outputs: Records token counts under
tokens(input,cache,output) and the run cost undercost_usdfor each trial run.
View results in LangSmith
Open Datasets & Experiments in LangSmith, select the dataset the plugin synced, such asterminal-bench@2.0, then open the Experiments tab. Each Harbor job appears as an experiment, and you can compare experiments by the reward and harbor_error feedback, the token counts and cost recorded on each run, and latency.
Configure the plugin inputs
The plugin reads each input from a constructor keyword argument first, then falls back to an environment variable. Set the inputs with environment variables:HARBOR_LANGSMITH_DATASET: The dataset name. Defaults to a name derived from the job.HARBOR_LANGSMITH_EXPERIMENT: The experiment base name. Defaults to the job name.LANGSMITH_ENDPOINT: The LangSmith API endpoint. Defaults tohttps://api.smith.langchain.com.LANGSMITH_WORKSPACE_ID: The target workspace.HARBOR_LANGSMITH_SYNC_DATASET: Set tofalseto disable dataset and example syncing.HARBOR_LANGSMITH_FAIL_FAST: Set totrueto raise on a LangSmith API error instead of continuing the job.
--pk on the command line, or under kwargs: in a job config file. The kwargs mirror the constructor options: dataset_name, experiment_name, endpoint, api_key, workspace_id, sync_dataset, and fail_fast.
Deep Agents
Thelanggraph agent runs a LangGraph application, such as a Deep Agent, as the Harbor agent. Select it with --agent langgraph. Harbor stages your project into the sandbox, installs its dependencies, and runs the graph inside the container for each trial.
Set your LangSmith and model credentials, then run Harbor. harbor run is an alias for harbor job start, which builds a job, spins up the environment, and runs the LangGraph agent:
Choose what to evaluate against
A task is one directory with a fixed layout:task.toml for configuration, instruction.md for the prompt, environment/ for the Dockerfile the sandbox is built from, and tests/ for the verifier that writes the reward. A dataset is many such task directories.
A task or dataset can be local or remote: point Harbor at your own folder of task directories, or pull one from Harbor’s registry.
Three inputs select the tasks a job runs against:
-t org/name[@ref]: A single task from the registry. Remote tasks are fetched with a registry lookup, then cloned at the pinned commit into~/.cache/harbor/tasks.-d name@version: A whole benchmark dataset, which is many tasks. Each task is resolved from the registry and cloned into the cache.-p <dir>: A local path to one task or a root folder of many tasks. Local paths are read in place, with no download and no cache copy.
-i and -x (glob include and exclude) and cap the count with -l.
A task directory has this layout:
Configure the agent
Pass agent kwargs with--ak:
--agent langgraph: Selects the LangGraph agent.--model <provider:model>: The model to run. There is no default, so this value is required. The agent resolves it with init_chat_model, so it must be resolvable by an installedlangchain-*provider inprovider:modelformat, for exampleanthropic:claude-opus-4-8. Aprovider/modelvalue is normalized toprovider:model. The model comes fromconfigurable['model']or theHARBOR_MODELenvironment variable, and an unresolvable or missing value raises aValueError.--ak project_path=<dir>: The local directory that containslanggraph.json.--ak graph=<name>: Which graph inlanggraph.jsonto run.--ak config=<file>: The config filename insideproject_paththat declares the graphs. Defaults tolanggraph.json.--ak configurable='{...}': LangGraph per-run config passed toconfig["configurable"]and read by the graph at invoke time. Common keys aremodel,model_kwargs, andcwd.--ak model_kwargs='{...}': Shorthand for the nestedmodel_kwargskey inconfigurable, for example{"temperature": 0, "max_tokens": 8000}.--ak dependency_overrides='[...]': Pip packages for the agent virtual environment. This list replaces the dependencies declared inlanggraph.json, which lets you pin or swap versions without editing the project, for example'["deepagents==0.1.5"]'.
Point langgraph.json at the agent and dependencies
The agent loads graphs from thelanggraph.json file in project_path. The file declares the graph entry points and the pip dependencies Harbor installs in the sandbox virtual environment:
--ak graph. Both build a Deep Agent with create_deep_agent and differ only in their inputs:
deep_agentresolves tomake_graph, a Deep Agent created with only the model.research_agentresolves tomake_research_graph, the same Deep Agent with a research system prompt.
--model (read from configurable.model) to create_deep_agent, which resolves it with init_chat_model():
configurable.model keeps the graph model-agnostic, but you can also hardcode the model in the graph when it should always run the same one. For a fixed model, point langgraph.json at a compiled graph instead of a factory:
Run the agent inside the sandbox
Harbor runs the entire agent inside the trial container.Single-trial lifecycle
Single-trial lifecycle
- Parse and prepare:
harbor runparses the flags into a job config. The job factory resolves and caches the tasks, validates the environment resource limits, and resolves the metrics before any trial runs. Caching applies to remote tasks only, so a-plocal task is read in place. - Fan out: Harbor builds the trial list from
n_attempts × tasks × agents, then runs trials concurrently up to the-nlimit, with-rretries. Parallelism is per trial, so different tasks, agents, and attempts run together, each in its own sandbox. - Create the trial: The trial loads the cached task, builds the LangGraph agent from
project_path,graph, andmodel, and constructs the environment without starting it. - Start the environment: The environment starts and brings up the container. For the Docker environment, this builds or reuses the image and runs the container.
- Install the agent: Harbor creates a virtual environment in the container, uploads
project_path, and pip installs thelanggraph.jsondependencies inside the container. - Run and verify: Harbor runs the graph inside the container through the LangGraph runner, then runs
tests/test.sh, which writes the reward to/logs/verifier/reward.txt. - Finalize: Harbor stops and deletes the container and writes the trial result. The job aggregates all trial results into one job result.
Sandboxes
Thelangsmith Harbor environment runs each trial on a LangSmith sandbox. Select it with --env langsmith to execute Harbor jobs on LangSmith infrastructure, alongside other sandbox providers. Each trial gets its own sandbox, which Harbor deletes when the trial finishes.
Run an evaluation
Run a Harbor job and select the LangSmith environment with--env langsmith:
Configure the sandbox environment
The LangSmith environment boots each sandbox from a filesystem snapshot. Provide one of the following in your Harbor task:- Prebuilt image: Set
[environment].docker_imageintask.toml. Harbor reuses or creates a snapshot from that image. - Existing snapshot: Pass
environment.kwargs.snapshot_nameto boot from a snapshot you already created. - Dockerfile: Include an
environment/Dockerfile. Harbor builds a snapshot from it with the build-from-Dockerfile flow, using the taskenvironment/directory as the build context.
--ek:
idle_ttl_seconds: Stops an idle sandbox after this many seconds. Set0to disable the idle timeout.delete_after_stop_seconds: Deletes a stopped sandbox after this many seconds.
Troubleshooting
- The job fails to start with an authentication error: Confirm
LANGSMITH_API_KEYis set, or thatLANGSMITH_PROFILEpoints to a configured profile. - The agent raises a
ValueErrorfor the model: Pass--modelinprovider:modelformat, and install the matchinglangchain-*provider package soinit_chat_model()can resolve it.
See also
- Run evaluations with Harbor
- Deep Agents documentation
- Datasets & Experiments
- Analyze an experiment
- Sandbox snapshots
- Harbor documentation
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

