Skip to main content
The default configuration for LangSmith Agent Server is designed to handle substantial read and write load across a variety of different workloads. By following the best practices outlined below, you can tune your Agent Server to perform optimally for your specific workload. This page describes scaling considerations for the Agent Server and provides examples to help configure your deployment. For some example self-hosted configurations, refer to the Example Agent Server configurations for scale section.

Scaling for write load

Write load is primarily driven by the following factors:
  • Creation of new runs
  • Creation of new checkpoints during run execution
  • Writing to long term memory
  • Creation of new threads
  • Creation of new assistants
  • Deletion of runs, checkpoints, threads, assistants and cron jobs
The following components are primarily responsible for handling write load:
  • API server: Handles initial request and persistence of data to the database.
  • Queue worker: Handles the execution of runs.
  • Redis: Handles the storage of ephemeral data about on-going runs.
  • Postgres: Handles the storage of all data, including run, thread, assistant, cron job, checkpointing and long term memory.

Best practices for scaling the write path

Change N_JOBS_PER_WORKER based on assistant characteristics

The default value of N_JOBS_PER_WORKER is 10. You can change this value to scale the maximum number of runs that can be executed at a time by a single queue worker based on the characteristics of your assistant. Some general guidelines for changing N_JOBS_PER_WORKER:
  • If your assistant is CPU bounded, the default value of 10 is likely sufficient. You might lower N_JOBS_PER_WORKER if you notice excessive CPU usage on queue workers or delays in run execution.
  • If your assistant is IO bounded, increase N_JOBS_PER_WORKER to handle more concurrent runs per worker.
There is no upper limit to N_JOBS_PER_WORKER. However, queue workers are greedy when fetching new runs, which means they will try to pick up as many runs as they have available jobs and begin executing them immediately. Setting N_JOBS_PER_WORKER too high in environments with bursty traffic can lead to uneven worker utilization and increased run execution times.

Avoid synchronous blocking operations

Avoid synchronous blocking operations in your code and prefer asynchronous operations. Long synchronous operations can block the main event loop, causing longer request and run execution times and potential timeouts. For example, consider an application that needs to sleep for 1 second. Instead of using synchronous code like this:
import time

def my_function():
    time.sleep(1)
Prefer asynchronous code like this:
import asyncio

async def my_function():
    await asyncio.sleep(1)
If an assistant requires synchronous blocking operations, set BG_JOB_ISOLATED_LOOPS to True to execute each run in a separate event loop.

Minimize redundant checkpointing

Minimize redundant checkpointing by setting durability to the minimum value necessary to ensure your data is durable. The default durability mode is "async", meaning checkpoints are written after each step asynchronously. If an assistant needs to persist only the final state of the run, durabilitycan be set to”exit”`, storing only the final state of the run. This can be set when creating the run:
from langgraph_sdk import get_client

client = get_client(url=<DEPLOYMENT_URL>)
thread = await client.threads.create()
run = await client.runs.create(
    thread_id=thread["thread_id"],
    assistant_id="agent",
    durability="exit"
)

Self-hosted

These settings are only required for self-hosted deployments. By default, cloud deployments already have these best practices enabled.
Enable the use of queue workers
By default, the API server manages the queue and does not use queue workers. You can enable the use of queue workers by setting the queue.enabled configuration to true.
queue:
  enabled: true
This will allow the API server to offload the queue management to the queue workers, significantly reducing the load on the API server and allowing it to focus on handling requests.
Support a number of jobs equal to expected throughput
The more runs you execute in parallel, the more jobs you will need to handle the load. There are two main parameters to scale the available jobs:
  • number_of_queue_workers: The number of queue workers provisioned.
  • N_JOBS_PER_WORKER: The number of runs that a single queue work can execute at a time. Defaults to 10.
You can calculate the available jobs with the following equation:
available_jobs = number_of_queue_workers * `N_JOBS_PER_WORKER`
Throughput is then the number of runs that can be executed per second by the available jobs:
throughput_per_second = available_jobs / average_run_execution_time_seconds
Therefore, the minimum number of queue workers you should provision to support your expected steady state throughput is:
number_of_queue_workers = throughput_per_second * average_run_execution_time_seconds / `N_JOBS_PER_WORKER`
Configure autoscaling for bursty workloads
Autoscaling is disabled by default, but should be configured for bursty workloads. Using the same calculations as the previous section, you can determine the maximum number of queue workers you should allow the autoscaler to scale to based on maximum expected throughput.

Scaling for read load

Read load is primarily driven by the following factors: The following components are primarily responsible for handling read load:
  • API server: Handles the request and direct retrieval of data from the database.
  • Postgres: Handles the storage of all data, including run, thread, assistant, cron job, checkpointing and long term memory.
  • Redis: Handles the storage of ephemeral data about on-going runs, including streaming messages from queue workers to api servers.

Best practices for scaling the read path

Use filtering to reduce the number of resources returned per request

Agent Server provides a search API for each resource type. These APIs implement pagination by default and offer many filtering options. Use filtering to reduce the number of resources returned per request and improve performance.

Set a TTLs to automatically delete old data

Set a TTL on threads to automatically clean up old data. Runs and checkpoints are automatically deleted when the associated thread is deleted.

Avoid polling and use /join to monitor the state of a run

Avoid polling the state of a run by using the /join API endpoint. This method returns the final state of the run once the run is complete. If you need to monitor the output of a run in real-time, use the /stream API endpoint. This method streams the run output including the final state of the run.

Self-hosted

These settings are only required for self-hosted deployments. By default, cloud deployments already have these best practices enabled.
Configure autoscaling for bursty workloads
Autoscaling is disabled by default, but should be configured for bursty workloads. You can determine the maximum number of api servers you should allow the autoscaler to scale to based on maximum expected throughput. The default for cloud deployments is a maximum of 10 API servers.

Example self-hosted Agent Server configurations

The exact optimal configuration depends on your application complexity, request patterns, and data requirements. Use the following examples in combination with the information in the previous sections and your specific usage to update your deployment configuration as needed. If you have any questions, reach out to the LangChain team at support@langchain.dev.
The following table provides an overview comparing different LangSmith Agent Server configurations for various load patterns (read requests per second / write requests per second) and standard assistant characteristics (average run execution time of 1 second, moderate CPU and memory usage):
Low / lowLow / highHigh / lowMedium / mediumHigh / high
5550050500
5500550500
API servers
(1 CPU, 2Gi per server)
1 (default)610315
Queue workers
(1 CPU, 2Gi per worker)
1 (default)101 (default)510
N_JOBS_PER_WORKER10 (default)50101050
Redis resources2 Gi (default)2 Gi (default)2 Gi (default)2 Gi (default)2 Gi (default)
Postgres resources2 CPU
8 Gi (default)
4 CPU
16 Gi memory
4 CPU
16 Gi
4 CPU
16 Gi memory
8 CPU
32 Gi memory
The following sample configurations enable each of these setups. Load levels are defined as:
  • Low means approximately 5 requests per second
  • Medium means approximately 50 requests per second
  • High means approximately 500 requests per second

Low reads, low writes

The default LangSmith Deployment configuration will handle this load. No custom resource configuration is needed here.

Low reads, high writes

You have a high volume of write requests (500 per second) being processed by your deployment, but relatively few read requests (5 per second). For this, we recommend a configuration like this:
# Example configuration for low reads, high writes (5 read/500 write requests per second)
api:
  replicas: 6
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

queue:
  replicas: 10
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

config:
  numberOfJobsPerWorker: 50

redis:
  resources:
    requests:
      memory: "2Gi"
    limits:
      memory: "2Gi"

postgres:
  resources:
    requests:
      cpu: "4"
      memory: "16Gi"
    limits:
      cpu: "8"
      memory: "32Gi"

High reads, low writes

You have a high volume of read requests (500 per second) but relatively few write requests (5 per second). For this, we recommend a configuration like this:
# Example configuration for high reads, low writes (500 read/5 write requests per second)
api:
  replicas: 10
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

queue:
  replicas: 1  # Default, minimal write load
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

redis:
  resources:
    requests:
      memory: "2Gi"
    limits:
      memory: "2Gi"

postgres:
  resources:
    requests:
      cpu: "4"
      memory: "16Gi"
    limits:
      cpu: "8"
      memory: "32Gi"
  # Consider read replicas for high read scenarios
  readReplicas: 2

Medium reads, medium writes

This is a balanced configuration that should handle moderate read and write loads (50 read/50 write requests per second). For this, we recommend a configuration like this:
# Example configuration for medium reads, medium writes (50 read/50 write requests per second)
api:
  replicas: 3
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

queue:
  replicas: 5
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

redis:
  resources:
    requests:
      memory: "2Gi"
    limits:
      memory: "2Gi"

postgres:
  resources:
    requests:
      cpu: "4"
      memory: "16Gi"
    limits:
      cpu: "8"
      memory: "32Gi"

High reads, high writes

You have high volumes of both read and write requests (500 read/500 write requests per second). For this, we recommend a configuration like this:
# Example configuration for high reads, high writes (500 read/500 write requests per second)
api:
  replicas: 15
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

queue:
  replicas: 10
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

config:
  numberOfJobsPerWorker: 50

redis:
  resources:
    requests:
      memory: "2Gi"
    limits:
      memory: "2Gi"

postgres:
  resources:
    requests:
      cpu: "8"
      memory: "32Gi"
    limits:
      cpu: "16"
      memory: "64Gi"

Autoscaling

If your deployment experiences bursty traffic, you can enable autoscaling to scale the number of API servers and queue workers to handle the load. Here is a sample configuration for autoscaling for high reads and high writes:
api:
  autoscaling:
    enabled: true
    minReplicas: 15
    maxReplicas: 25

queue:
  autoscaling:
    enabled: true
    minReplicas: 10
    maxReplicas: 20
Ensure that your deployment environment has sufficient resources to scale to the recommended size. Monitor your applications and infrastructure to ensure optimal performance. Consider implementing monitoring and alerting to track resource usage and application performance.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.