Configure LangSmith Agent Server for scale

The default configuration for LangSmith Agent Server is designed to handle substantial read and write load across a variety of different workloads. By following the best practices outlined below, you can tune your Agent Server to perform optimally for your specific workload. This page describes scaling considerations for the Agent Server and provides examples to help configure your deployment. For some example self-hosted configurations, refer to the Example Agent Server configurations for scale section.

Scaling for write load

Write load is primarily driven by the following factors:

Creation of new runs
Creation of new checkpoints during run execution
Writing to long term memory
Creation of new threads
Creation of new assistants
Deletion of runs, checkpoints, threads, assistants and cron jobs

The following components are primarily responsible for handling write load:

API server: Handles initial request and persistence of data to the database.
Queue worker: Handles the execution of runs.
Redis: Handles the storage of ephemeral data about on-going runs.
Postgres: Handles the storage of all data, including run, thread, assistant, cron job, checkpointing and long term memory.

Best practices for scaling the write path

Change `N_JOBS_PER_WORKER` based on assistant characteristics

The default value of N_JOBS_PER_WORKER is 10. You can change this value to scale the maximum number of runs that can be executed at a time by a single queue worker based on the characteristics of your assistant. Some general guidelines for changing N_JOBS_PER_WORKER:

If your assistant is CPU bounded, the default value of 10 is likely sufficient. You might lower N_JOBS_PER_WORKER if you notice excessive CPU usage on queue workers or delays in run execution.
If your assistant is IO bounded, increase N_JOBS_PER_WORKER to handle more concurrent runs per worker.

There is no upper limit to N_JOBS_PER_WORKER. However, queue workers are greedy when fetching new runs, which means they will try to pick up as many runs as they have available jobs and begin executing them immediately. Setting N_JOBS_PER_WORKER too high in environments with bursty traffic can lead to uneven worker utilization and increased run execution times.

Avoid synchronous blocking operations

Avoid synchronous blocking operations in your code and prefer asynchronous operations. Long synchronous operations can block the main event loop, causing longer request and run execution times and potential timeouts. For example, consider an application that needs to sleep for 1 second. Instead of using synchronous code like this:

import time

def my_function():
    time.sleep(1)

Prefer asynchronous code like this:

import asyncio

async def my_function():
    await asyncio.sleep(1)

If an assistant requires synchronous blocking operations, set BG_JOB_ISOLATED_LOOPS to True to execute each run in a separate event loop.

Minimize redundant checkpointing

Minimize redundant checkpointing by setting durability to the minimum value necessary to ensure your data is durable. The default durability mode is

"async", meaning checkpoints are written after each step asynchronously. If an assistant needs to persist only the final state of the run,

durabilitycan be set to”exit”`, storing only the final state of the run. This can be set when creating the run:

from langgraph_sdk import get_client

client = get_client(url=<DEPLOYMENT_URL>)
thread = await client.threads.create()
run = await client.runs.create(
    thread_id=thread["thread_id"],
    assistant_id="agent",
    durability="exit"
)

Self-hosted

These settings are only required for self-hosted deployments. By default, cloud deployments already have these best practices enabled.

Enable the use of queue workers

By default, the API server manages the queue and does not use queue workers. You can enable the use of queue workers by setting the queue.enabled configuration to true.

queue:
  enabled: true

This will allow the API server to offload the queue management to the queue workers, significantly reducing the load on the API server and allowing it to focus on handling requests.

Support a number of jobs equal to expected throughput

The more runs you execute in parallel, the more jobs you will need to handle the load. There are two main parameters to scale the available jobs:

number_of_queue_workers: The number of queue workers provisioned.
N_JOBS_PER_WORKER: The number of runs that a single queue work can execute at a time. Defaults to 10.

You can calculate the available jobs with the following equation:

available_jobs = number_of_queue_workers * `N_JOBS_PER_WORKER`

Throughput is then the number of runs that can be executed per second by the available jobs:

throughput_per_second = available_jobs / average_run_execution_time_seconds

Therefore, the minimum number of queue workers you should provision to support your expected steady state throughput is:

number_of_queue_workers = throughput_per_second * average_run_execution_time_seconds / `N_JOBS_PER_WORKER`

Configure autoscaling for bursty workloads

Autoscaling is disabled by default, but should be configured for bursty workloads. Using the same calculations as the previous section, you can determine the maximum number of queue workers you should allow the autoscaler to scale to based on maximum expected throughput.

Scaling for read load

Read load is primarily driven by the following factors:

Getting the results of a run
Getting the state of a thread
Searching for runs, threads, cron jobs and assistants
Retrieving checkpoints and long term memory

The following components are primarily responsible for handling read load:

API server: Handles the request and direct retrieval of data from the database.
Postgres: Handles the storage of all data, including run, thread, assistant, cron job, checkpointing and long term memory.
Redis: Handles the storage of ephemeral data about on-going runs, including streaming messages from queue workers to api servers.

Best practices for scaling the read path

Use filtering to reduce the number of resources returned per request

Agent Server provides a search API for each resource type. These APIs implement pagination by default and offer many filtering options. Use filtering to reduce the number of resources returned per request and improve performance.

Set a TTLs to automatically delete old data

Set a TTL on threads to automatically clean up old data. Runs and checkpoints are automatically deleted when the associated thread is deleted.

Avoid polling and use /join to monitor the state of a run

Avoid polling the state of a run by using the /join API endpoint. This method returns the final state of the run once the run is complete. If you need to monitor the output of a run in real-time, use the /stream API endpoint. This method streams the run output including the final state of the run.

Self-hosted

These settings are only required for self-hosted deployments. By default, cloud deployments already have these best practices enabled.

Configure autoscaling for bursty workloads

Autoscaling is disabled by default, but should be configured for bursty workloads. You can determine the maximum number of api servers you should allow the autoscaler to scale to based on maximum expected throughput. The default for cloud deployments is a maximum of 10 API servers.

Example self-hosted Agent Server configurations

The exact optimal configuration depends on your application complexity, request patterns, and data requirements. Use the following examples in combination with the information in the previous sections and your specific usage to update your deployment configuration as needed. If you have any questions, contact support via support.langchain.com.

The following table provides an overview comparing different LangSmith Agent Server configurations for various load patterns (read requests per second / write requests per second) and standard assistant characteristics (average run execution time of 1 second, moderate CPU and memory usage):

	Low / low	Low / high	High / low	Medium / medium	High / high
	5	5	500	50	500
	5	500	5	50	500
API servers (1 CPU, 2Gi per server)	1 (default)	6	10	3	15
Queue workers (1 CPU, 2Gi per worker)	1 (default)	10	1 (default)	5	10
`N_JOBS_PER_WORKER`	10 (default)	50	10	10	50
Redis resources	2 Gi (default)	2 Gi (default)	2 Gi (default)	2 Gi (default)	2 Gi (default)
Postgres resources	2 CPU 8 Gi (default)	4 CPU 16 Gi memory	4 CPU 16 Gi	4 CPU 16 Gi memory	8 CPU 32 Gi memory

The following sample configurations enable each of these setups. Load levels are defined as:

Low means approximately 5 requests per second
Medium means approximately 50 requests per second
High means approximately 500 requests per second

Low reads, low writes

The default LangSmith Deployment configuration will handle this load. No custom resource configuration is needed here.

Low reads, high writes

You have a high volume of write requests (500 per second) being processed by your deployment, but relatively few read requests (5 per second). For this, we recommend a configuration like this:

# Example configuration for low reads, high writes (5 read/500 write requests per second)
api:
  replicas: 6
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

queue:
  replicas: 10
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

config:
  numberOfJobsPerWorker: 50

redis:
  resources:
    requests:
      memory: "2Gi"
    limits:
      memory: "2Gi"

postgres:
  resources:
    requests:
      cpu: "4"
      memory: "16Gi"
    limits:
      cpu: "8"
      memory: "32Gi"

High reads, low writes

You have a high volume of read requests (500 per second) but relatively few write requests (5 per second). For this, we recommend a configuration like this:

# Example configuration for high reads, low writes (500 read/5 write requests per second)
api:
  replicas: 10
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

queue:
  replicas: 1  # Default, minimal write load
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

redis:
  resources:
    requests:
      memory: "2Gi"
    limits:
      memory: "2Gi"

postgres:
  resources:
    requests:
      cpu: "4"
      memory: "16Gi"
    limits:
      cpu: "8"
      memory: "32Gi"
  # Consider read replicas for high read scenarios
  readReplicas: 2

Medium reads, medium writes

This is a balanced configuration that should handle moderate read and write loads (50 read/50 write requests per second). For this, we recommend a configuration like this:

# Example configuration for medium reads, medium writes (50 read/50 write requests per second)
api:
  replicas: 3
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

queue:
  replicas: 5
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

redis:
  resources:
    requests:
      memory: "2Gi"
    limits:
      memory: "2Gi"

postgres:
  resources:
    requests:
      cpu: "4"
      memory: "16Gi"
    limits:
      cpu: "8"
      memory: "32Gi"

High reads, high writes

You have high volumes of both read and write requests (500 read/500 write requests per second). For this, we recommend a configuration like this:

# Example configuration for high reads, high writes (500 read/500 write requests per second)
api:
  replicas: 15
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

queue:
  replicas: 10
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

config:
  numberOfJobsPerWorker: 50

redis:
  resources:
    requests:
      memory: "2Gi"
    limits:
      memory: "2Gi"

postgres:
  resources:
    requests:
      cpu: "8"
      memory: "32Gi"
    limits:
      cpu: "16"
      memory: "64Gi"

Autoscaling

If your deployment experiences bursty traffic, you can enable autoscaling to scale the number of API servers and queue workers to handle the load. Here is a sample configuration for autoscaling for high reads and high writes:

api:
  autoscaling:
    enabled: true
    minReplicas: 15
    maxReplicas: 25

queue:
  autoscaling:
    enabled: true
    minReplicas: 10
    maxReplicas: 20

Ensure that your deployment environment has sufficient resources to scale to the recommended size. Monitor your applications and infrastructure to ensure optimal performance. Consider implementing monitoring and alerting to track resource usage and application performance.

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Configure app for deployment

Deployment guides

App development

Studio

Auth & access control

Server customization

Configure LangSmith Agent Server for scale

Scaling for write load

Best practices for scaling the write path

Change `N_JOBS_PER_WORKER` based on assistant characteristics

Avoid synchronous blocking operations

Minimize redundant checkpointing

Self-hosted

Enable the use of queue workers

Support a number of jobs equal to expected throughput

Configure autoscaling for bursty workloads

Scaling for read load

Best practices for scaling the read path

Use filtering to reduce the number of resources returned per request

Set a TTLs to automatically delete old data

Avoid polling and use /join to monitor the state of a run

Self-hosted

Configure autoscaling for bursty workloads

Example self-hosted Agent Server configurations

Low reads, low writes

Low reads, high writes

High reads, low writes

Medium reads, medium writes

High reads, high writes

Autoscaling

Configure app for deployment

Deployment guides

App development

Studio

Auth & access control

Server customization

​Scaling for write load

​Best practices for scaling the write path

​Change N_JOBS_PER_WORKER based on assistant characteristics

​Avoid synchronous blocking operations

​Minimize redundant checkpointing

​Self-hosted

Enable the use of queue workers

Support a number of jobs equal to expected throughput

Configure autoscaling for bursty workloads

​Scaling for read load

​Best practices for scaling the read path

​Use filtering to reduce the number of resources returned per request

​Set a TTLs to automatically delete old data

​Avoid polling and use /join to monitor the state of a run

​Self-hosted

Configure autoscaling for bursty workloads

​Example self-hosted Agent Server configurations

​Low reads, low writes

​Low reads, high writes

​High reads, low writes

​Medium reads, medium writes

​High reads, high writes

​Autoscaling

Scaling for write load

Best practices for scaling the write path

Change `N_JOBS_PER_WORKER` based on assistant characteristics

Avoid synchronous blocking operations

Minimize redundant checkpointing

Self-hosted

Scaling for read load

Best practices for scaling the read path

Use filtering to reduce the number of resources returned per request

Set a TTLs to automatically delete old data

Avoid polling and use /join to monitor the state of a run

Self-hosted

Example self-hosted Agent Server configurations

Low reads, low writes

Low reads, high writes

High reads, low writes

Medium reads, medium writes

High reads, high writes

Autoscaling