This page describes how to plan, configure, and operate disaster recovery (DR) for self-hosted LangSmith Observability and Evaluation. It covers what data must be protected, where it lives, how to back it up, and how to recover the platform after a regional or zonal failure.Documentation Index
Fetch the complete documentation index at: https://docs.langchain.com/llms.txt
Use this file to discover all available pages before exploring further.
Shared responsibility. For self-hosted deployments you are responsible for backups, replication, restore testing, and recovery procedures for every component, including LangSmith pods and all backing data stores. LangChain is responsible only for the LangSmith software itself. For the equivalent SaaS responsibilities, see the Shared responsibility model.
What you are recovering
Self-hosted LangSmith is composed of stateless services backed by four state stores. Recovery planning is almost entirely about the state stores. You can recreate the stateless services at any time by reapplying the Helm chart.| Layer | Components | State | Recovery action |
|---|---|---|---|
| LangSmith services | langsmith-frontend, langsmith-backend, langsmith-platform-backend, langsmith-queue, langsmith-ingest-queue, langsmith-playground, langsmith-ace-backend | Stateless | Reinstall the Helm chart |
| PostgreSQL | Operational data: orgs, workspaces, users, API keys, datasets, prompts, projects, deployments metadata | Durable | Restore from backup or replica |
| ClickHouse | Traces and feedback (high volume analytical data) | Durable | Restore from backup or replica |
| Blob storage (S3/GCS/Azure Blob) | Run inputs, outputs, errors, manifests, extras, events, attachments (when enabled) | Durable | Restore from versioned bucket or replica |
| Redis (or Valkey) | Ephemeral queue state, pub/sub, cache, run heartbeats | Ephemeral | Reprovision; no restore required |
| Kubernetes objects | Helm values, Secrets, TLS material, IRSA / Workload Identity bindings | Configuration | Re-apply from source control or back up cluster state |
Plan your RPO and RTO
Before designing your DR architecture, define two targets:- Recovery Point Objective (RPO): the maximum amount of data loss your organization can tolerate, measured in time. With managed Postgres PITR, RPO is typically less than 5 minutes. With nightly snapshots only, RPO can be up to 24 hours.
- Recovery Time Objective (RTO): the maximum time you can take to restore service after a failure. A warm cross-region replica can deliver RTO in minutes; a cold restore from snapshot can take hours, especially for large ClickHouse datasets.
| Profile | Typical RPO | Typical RTO | Approach |
|---|---|---|---|
| Snapshot-only | 6 to 24 hours | Hours | Daily managed backups of each store. Lowest cost, longest restore. |
| Multi-AZ HA | Seconds | Minutes (zone failure) | Synchronous standby in another AZ for Postgres and ClickHouse, Multi-AZ Redis, zone-redundant blob storage. Standard production posture. |
| Cross-region DR | Minutes to hours | Hours | Backups of Postgres, ClickHouse, and blob storage copied to a second region, restored on demand. Optionally a Postgres cross-region replica for a tighter Postgres RPO. Highest cost, slower recovery than Multi-AZ, but protects against a regional outage. |
Postgres
LangSmith uses PostgreSQL as the primary store for operational and transactional data. All communication with Postgres uses retries for retry-able errors, so a brief outage during failover usually does not surface as user-visible errors. A prolonged outage will render the LangSmith API unavailable.Use a managed service
We strongly recommend running Postgres on a managed service in production. Managed services provide built-in automated backups, PITR, and HA failover. For setup, refer to Connect external Postgres.- AWS
- GCP
- Azure
Run Amazon RDS for PostgreSQL or Aurora PostgreSQL in Multi-AZ mode.
- Backups: Enable automated backups with a retention window that matches your compliance posture (7 to 35 days is typical).
- PITR: Automated backups include PITR within the retention window.
- HA: Multi-AZ deployments maintain a synchronous standby in a second availability zone with automatic failover.
- Cross-region DR: For Aurora, configure an Aurora Global Database. For RDS, use cross-region read replicas or copy automated snapshots to a secondary region.
- Encryption: Enable storage encryption with a customer-managed KMS key.
In-cluster Postgres
If you must run Postgres in-cluster from the bundled chart, you are responsible for backing up the underlying PersistentVolume. Snapshot the PVC on a regular cadence using your CSI driver’s snapshot class, and copy snapshots to object storage or a different region. This path is not recommended for production.ClickHouse
ClickHouse holds the high-volume trace and feedback data and is typically the largest data store in a LangSmith deployment. Backups and replication need to be planned for cost and restore-time impact.Managed ClickHouse
The fastest path to a resilient ClickHouse is a managed option. See Connect external ClickHouse.- LangSmith Managed ClickHouse: LangChain operates the ClickHouse cluster, including backups and replication. VPC peering connects it to your LangSmith installation.
- ClickHouse Cloud: Provides built-in backups, replication, and HA. Available on AWS, GCP, and Azure marketplaces.
Self-managed replicated cluster
If you self-manage ClickHouse for compliance or air-gap reasons, use a replicated cluster. A single-node ClickHouse instance cannot meet a meaningful RPO.- Configure a multi-node ClickHouse cluster with replication via Keeper or ZooKeeper.
- Set the
clustervalue in the LangSmith chart so migrations createReplicatedtable engines from the start. Clustered setups must be configured against a fresh schema, you cannot convert a standalone instance to clustered later. - Spread replicas across availability zones.
- Schedule
BACKUP TABLEorBACKUP DATABASEto object storage on a frequency matching your RPO. The communityclickhouse-backuptool is also a popular option for scheduled, incremental backups with built-in S3, GCS, and Azure Blob support. - For cross-region DR, copy the backup bucket to a secondary region. Cross-region ClickHouse replication is not generally supported in self-managed deployments and is not offered by ClickHouse Cloud either, so plan for a backup/restore failover model rather than a hot replica.
Blob storage
If you have enabled blob storage (recommended for production), your run inputs, outputs, errors, manifests, extras, events, and attachments live in S3, GCS, or Azure Blob Storage. Cloud blob services are durable by design, but you should still configure protection against accidental deletion and regional outages.- AWS
- GCP
- Azure
- Enable S3 Versioning to protect against accidental deletes and overwrites.
- Enable MFA Delete for high-security buckets.
- For cross-region DR, configure Cross-Region Replication (CRR) to a bucket in your DR region.
- Use S3 Object Lock for write-once-read-many (WORM) retention.
- Encrypt with SSE-KMS. LangSmith supports passing a specific KMS key ARN, see KMS encryption header support.
Redis
Redis stores ephemeral metadata, queue state, and cross-instance pub/sub. No durable data is stored in Redis, so you do not need to back it up. Communication with Redis is retried for retry-able errors. The recovery design is to make Redis highly available within the active region and to reprovision it from scratch in the DR region.- Use the managed service for your cloud: Amazon ElastiCache, Google Cloud Memorystore, or Azure Cache for Redis.
- Enable Multi-AZ failover.
- For cross-region DR, provision a fresh Redis instance in the DR region during failover; do not reuse an active region’s Redis URI in the new cluster.
Kubernetes configuration and secrets
The Helm chart values, KubernetesSecrets, and identity bindings are as important as your data backups. A complete restore requires both.
- Helm values: Store
values.yamlin source control. Track per-environment overrides separately. - Image versions: Pin the LangSmith chart version and image tags so a recovery installs the same software version. See Self-host upgrades and Dependency versions.
- Secrets: LangSmith reads database, blob, and licensing credentials from Kubernetes
Secrets. Mirror these to your DR cluster’s secret manager (AWS Secrets Manager, GCP Secret Manager, or Azure Key Vault). See Use an existing secret. - TLS material: If you terminate TLS at the LangSmith ingress, back up the certificate and key, or reissue from your private CA in the DR region. See Custom TLS certificates.
- IRSA / Workload Identity bindings: Recreate IAM roles and service-account bindings in the DR region; service account ARNs and annotations are region-scoped.
- License key: Keep the LangSmith license key alongside other recovery secrets.
Reference deployment patterns
Single region with Multi-AZ HA (recommended baseline)
This is the minimum production posture and protects against zonal failures. It does not protect against a regional outage.- Kubernetes node pools across at least two availability zones.
- Postgres in Multi-AZ HA mode (RDS Multi-AZ, Cloud SQL HA, or Flexible Server zone-redundant).
- ClickHouse as a managed service, or a 3-node replicated cluster spread across AZs.
- Redis with Multi-AZ failover enabled.
- Blob storage with versioning enabled and a redundancy tier of at least zone-redundant.
- Daily snapshots of every data store retained for at least 7 days.
Cross-region active/passive DR
This protects against a regional outage. It is significantly more expensive but is the right pattern for tier-1 deployments.- A second Kubernetes cluster in the DR region with the LangSmith Helm chart installed but scaled to a low replica count (warm) or zero (cold).
- Postgres cross-region replica (RDS or Aurora cross-region replica, Cloud SQL cross-region replica, Azure Flexible Server cross-region replica). Promote on failover.
- ClickHouse Cloud or LangSmith Managed ClickHouse with a region failover plan, or ClickHouse backups copied to the DR region and restored into a fresh self-managed cluster on failover. Cross-region ClickHouse replication is not generally supported (ClickHouse Cloud does not offer it either), so plan for backup/restore rather than a hot DR replica.
- Blob storage replicated to a DR bucket with versioning and matching lifecycle rules.
- Redis provisioned fresh in the DR region during failover.
- DNS managed by Route 53, Cloud DNS, or Azure DNS with health checks and failover policies pointing at the LangSmith frontend ingress in each region.
LangSmith is a single-write platform. A cross-region deployment should be active/passive, not active/active. Writing to both regions concurrently against the same logical installation is not supported and will produce data inconsistency.
Recovery procedures
Restore after a zonal failure
In a single-region Multi-AZ deployment, zonal failures are handled automatically by your cloud provider:- Managed Postgres fails over to its standby in another AZ. LangSmith pods reconnect via the cluster endpoint after retry.
- Managed Redis fails over similarly. LangSmith retries reconnect automatically.
- Kubernetes reschedules LangSmith pods on healthy nodes in remaining AZs. Verify that node pools and Horizontal Pod Autoscaler limits allow this headroom.
- Verify ingest by submitting a test trace from the SDK and confirming it appears in the UI.
Restore after a regional failure
This is the cross-region failover runbook. Adapt to your specific infrastructure.Declare failover
Confirm the primary region is unavailable. Communicate to stakeholders that you are failing over and what the expected RTO is.
Promote data stores
Promote the Postgres cross-region replica to primary in the DR region. For ClickHouse Cloud or LangSmith Managed ClickHouse, initiate the documented region failover. For self-managed ClickHouse, restore the latest backup into the DR cluster (this is typically the longest step).
Repoint blob storage
Update the LangSmith Helm
config.blobStorage.bucketName and apiURL to point at the DR bucket. Confirm the bucket has the same TTL lifecycle rules. See Blob storage configuration.Provision Redis
Create a fresh managed Redis instance in the DR region. Update the LangSmith Helm
redis.external values to point at it. Do not import dumps from the primary Redis; provision empty.Scale the DR cluster
If running warm/cold, scale the LangSmith deployments to their production replica counts. Apply any pending Helm value updates from source control.
Run smoke tests
Submit a test trace, verify it lands in ClickHouse and (if blob storage is enabled) in the DR bucket. Open the UI and confirm traces, datasets, and projects load. Validate authentication. See Diagnostics.
Cut DNS over
Update DNS to route traffic to the DR ingress. Communicate the cutover to stakeholders.
Restore from snapshot
If you have lost the primary data store entirely and need to restore from snapshot:Stop ingestion
Scale
langsmith-queue and langsmith-ingest-queue to zero so no new traces are written while you restore.Restore Postgres
Restore the Postgres backup to a new instance or perform PITR to the latest pre-incident timestamp. Update the LangSmith Helm
postgres.external connection details to point to the restored instance.Restore ClickHouse
Restore the most recent ClickHouse backup that aligns in time with the Postgres restore point. Restore time scales with data size.
Restore blob storage
If you lost blob data (rare), restore versioned objects from S3/GCS/Azure or copy from a replicated DR bucket.
Testing your DR plan
A backup is only as good as the last successful restore. Schedule the following exercises:- Quarterly: Restore Postgres and ClickHouse snapshots into a non-production environment and run the diagnostics tooling and a smoke trace test. Measure actual restore time and confirm it is within RTO.
- Twice yearly: Perform a full cross-region failover drill against a staging installation. Promote the replica, repoint blob storage, scale the DR cluster, run smoke tests, and roll back.
- On every chart upgrade: Verify that the upgrade path does not invalidate your DR plan (for example, schema migrations applied only to the primary will need to replicate to the DR replica). See Self-host upgrades.
Related pages
- Scalability and resilience
- Shared responsibility model
- Connect external Postgres
- Connect external ClickHouse
- Connect external Redis
- Enable blob storage
- Self-host upgrades
- Use an existing secret
- Diagnostics for self-hosted
- AWS self-hosted reference architecture
- GCP self-hosted reference architecture
- Azure self-hosted reference architecture
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

