Scaling for write load
Write load is primarily driven by the following factors:- Creation of new runs
- Creation of new checkpoints during run execution
- Writing to long term memory
- Creation of new threads
- Creation of new assistants
- Deletion of runs, checkpoints, threads, assistants and cron jobs
- API server: Handles initial request and persistence of data to the database.
- Queue worker: Handles the execution of runs.
- Redis: Handles the storage of ephemeral data about on-going runs.
- Postgres: Handles the storage of all data, including run, thread, assistant, cron job, checkpointing and long term memory.
Best practices for scaling the write path
Change N_JOBS_PER_WORKER based on assistant characteristics
The default value of N_JOBS_PER_WORKER is 10. You can change this value to scale the maximum number of runs that can be executed at a time by a single queue worker based on the characteristics of your assistant.
Some general guidelines for changing N_JOBS_PER_WORKER:
- If your assistant is CPU bounded, the default value of 10 is likely sufficient. You might lower
N_JOBS_PER_WORKERif you notice excessive CPU usage on queue workers or delays in run execution. - If your assistant is IO bounded, increase
N_JOBS_PER_WORKERto handle more concurrent runs per worker.
N_JOBS_PER_WORKER. However, queue workers are greedy when fetching new runs, which means they will try to pick up as many runs as they have available jobs and begin executing them immediately. Setting N_JOBS_PER_WORKER too high in environments with bursty traffic can lead to uneven worker utilization and increased run execution times.
Avoid synchronous blocking operations
Avoid synchronous blocking operations in your code and prefer asynchronous operations. Long synchronous operations can block the main event loop, causing longer request and run execution times and potential timeouts. For example, consider an application that needs to sleep for 1 second. Instead of using synchronous code like this:BG_JOB_ISOLATED_LOOPS to True to execute each run in a separate event loop.
Minimize redundant checkpointing
Minimize redundant checkpointing by settingdurability to the minimum value necessary to ensure your data is durable.
The default durability mode is "async", meaning checkpoints are written after each step asynchronously. If an assistant needs to persist only the final state of the run, durabilitycan be set to”exit”`, storing only the final state of the run. This can be set when creating the run:
Self-hosted
These settings are only required for self-hosted deployments. By default, cloud deployments already have these best practices enabled.
Enable the use of queue workers
By default, the API server manages the queue and does not use queue workers. You can enable the use of queue workers by setting thequeue.enabled configuration to true.
Support a number of jobs equal to expected throughput
The more runs you execute in parallel, the more jobs you will need to handle the load. There are two main parameters to scale the available jobs:number_of_queue_workers: The number of queue workers provisioned.N_JOBS_PER_WORKER: The number of runs that a single queue work can execute at a time. Defaults to 10.
Configure autoscaling for bursty workloads
Autoscaling is disabled by default, but should be configured for bursty workloads. Using the same calculations as the previous section, you can determine the maximum number of queue workers you should allow the autoscaler to scale to based on maximum expected throughput.Scaling for read load
Read load is primarily driven by the following factors:- Getting the results of a run
- Getting the state of a thread
- Searching for runs, threads, cron jobs and assistants
- Retrieving checkpoints and long term memory
- API server: Handles the request and direct retrieval of data from the database.
- Postgres: Handles the storage of all data, including run, thread, assistant, cron job, checkpointing and long term memory.
- Redis: Handles the storage of ephemeral data about on-going runs, including streaming messages from queue workers to api servers.
Best practices for scaling the read path
Use filtering to reduce the number of resources returned per request
Agent Server provides a search API for each resource type. These APIs implement pagination by default and offer many filtering options. Use filtering to reduce the number of resources returned per request and improve performance.Set a TTLs to automatically delete old data
Set a TTL on threads to automatically clean up old data. Runs and checkpoints are automatically deleted when the associated thread is deleted.Avoid polling and use /join to monitor the state of a run
Avoid polling the state of a run by using the/join API endpoint. This method returns the final state of the run once the run is complete.
If you need to monitor the output of a run in real-time, use the /stream API endpoint. This method streams the run output including the final state of the run.
Self-hosted
These settings are only required for self-hosted deployments. By default, cloud deployments already have these best practices enabled.
Configure autoscaling for bursty workloads
Autoscaling is disabled by default, but should be configured for bursty workloads. You can determine the maximum number of api servers you should allow the autoscaler to scale to based on maximum expected throughput. The default for cloud deployments is a maximum of 10 API servers.Example self-hosted Agent Server configurations
The exact optimal configuration depends on your application complexity, request patterns, and data requirements. Use the following examples in combination with the information in the previous sections and your specific usage to update your deployment configuration as needed. If you have any questions, reach out to the LangChain team at support@langchain.dev.
| Low / low | Low / high | High / low | Medium / medium | High / high | |
|---|---|---|---|---|---|
| 5 | 5 | 500 | 50 | 500 | |
| 5 | 500 | 5 | 50 | 500 | |
| API servers (1 CPU, 2Gi per server) | 1 (default) | 6 | 10 | 3 | 15 |
| Queue workers (1 CPU, 2Gi per worker) | 1 (default) | 10 | 1 (default) | 5 | 10 |
N_JOBS_PER_WORKER | 10 (default) | 50 | 10 | 10 | 50 |
| Redis resources | 2 Gi (default) | 2 Gi (default) | 2 Gi (default) | 2 Gi (default) | 2 Gi (default) |
| Postgres resources | 2 CPU 8 Gi (default) | 4 CPU 16 Gi memory | 4 CPU 16 Gi | 4 CPU 16 Gi memory | 8 CPU 32 Gi memory |
- Low means approximately 5 requests per second
- Medium means approximately 50 requests per second
- High means approximately 500 requests per second
Low reads, low writes
The default LangSmith Deployment configuration will handle this load. No custom resource configuration is needed here.Low reads, high writes
You have a high volume of write requests (500 per second) being processed by your deployment, but relatively few read requests (5 per second). For this, we recommend a configuration like this:High reads, low writes
You have a high volume of read requests (500 per second) but relatively few write requests (5 per second). For this, we recommend a configuration like this:Medium reads, medium writes
This is a balanced configuration that should handle moderate read and write loads (50 read/50 write requests per second). For this, we recommend a configuration like this:High reads, high writes
You have high volumes of both read and write requests (500 read/500 write requests per second). For this, we recommend a configuration like this:Autoscaling
If your deployment experiences bursty traffic, you can enable autoscaling to scale the number of API servers and queue workers to handle the load. Here is a sample configuration for autoscaling for high reads and high writes:Ensure that your deployment environment has sufficient resources to scale to the recommended size. Monitor your applications and infrastructure to ensure optimal performance. Consider implementing monitoring and alerting to track resource usage and application performance.