Reference architecture
We recommend leveraging GCP’s managed services to provide a scalable, secure, and resilient platform. The following architecture applies to both self-hosted and hybrid and aligns with the Google Cloud Well-Architected Framework:
- Ingress & networking: Requests enter via Cloud Load Balancing within your VPC, secured using Cloud Armor and IAM-based authentication.
- Frontend & backend services: Containers run on Google Kubernetes Engine (GKE), orchestrated behind the load balancer. Routes requests to other services within the cluster as necessary.
-
Storage & databases:
- Cloud SQL for PostgreSQL: metadata, projects, users, and short-term and long-term memory for deployed agents. LangSmith supports PostgreSQL version 14 or higher.
- Memorystore for Redis: caching and job queues. Memorystore can be in single-instance or cluster mode, running Redis OSS version 5 or higher.
- ClickHouse + Persistent Disks: analytics and trace storage.
- We recommend using an externally managed ClickHouse solution unless security or compliance reasons prevent you from doing so.
- ClickHouse is not required for hybrid deployments.
- Cloud Storage: object storage for trace artifacts and telemetry.
- LLM integration: Optionally proxy requests to Vertex AI for LLM inference.
- Monitoring & observability: Integrate with Cloud Monitoring and Cloud Logging
Compute options
LangSmith supports multiple compute options depending on your requirements:| Compute option | Description | Suitable for |
|---|---|---|
| Google Kubernetes Engine (preferred) | Advanced scaling and multi-tenant support | Large enterprises |
| Compute Engine-based | Full control, BYO-infra | Regulated or air-gapped environments |
Google cloud Well-Architected best practices
This reference is designed to align with the six pillars of the Google Cloud Well-Architected Framework:Operational excellence
- Automate deployments with IaC (Terraform / Deployment Manager).
- Use Secret Manager for configuration and sensitive data.
- Configure your LangSmith instance to export telemetry data and continuously monitor via Cloud Logging.
- The preferred method to manage LangSmith deployments is to create a CI process that builds Agent Server images and pushes them to Artifact Registry. Create a test deployment for pull requests before deploying a new revision to staging or production upon PR merge.
Security
- Use IAM roles with least-privilege policies and Workload Identity for secure pod-to-GCP-service authentication.
- Enable encryption at rest (Cloud SQL, Cloud Storage, Persistent Disks) and in transit (TLS 1.2+).
- Integrate with Secret Manager for credentials.
- Use Identity Platform or Workload Identity Federation as an IDP in conjunction with LangSmith’s built-in authentication and authorization features to secure access to agents and their tools.
Reliability
- Replicate the LangSmith data plane across regions: Deploy identical data planes to Kubernetes clusters in different regions for LangSmith Deployment. Deploy Cloud SQL and GKE services across multiple zones.
- Implement autoscaling for backend workers using Horizontal Pod Autoscaler and Cluster Autoscaler.
- Use Cloud DNS health checks and failover policies.
Performance optimization
- Leverage Compute Engine instances for optimized compute with machine type selection.
- Use Cloud Storage lifecycle policies for infrequently accessed trace data, moving to Nearline or Coldline storage classes.
Cost optimization
- Right-size GKE clusters using Committed Use Discounts and Sustained Use Discounts.
- Monitor cost KPIs using Cloud Billing dashboards and Cost Management tools.
Sustainability
- Minimize idle workloads with on-demand compute and autoscaling.
- Store telemetry in low-latency, low-cost tiers using Cloud Storage lifecycle policies.
- Enable auto-shutdown for non-prod environments using scheduled actions.
Security and compliance
LangSmith can be configured for:- Private Service Connect-only access (no public internet exposure, besides egress necessary for billing).
- Cloud KMS-based encryption keys for Cloud Storage, Cloud SQL, and Persistent Disks.
- Audit logging to Cloud Logging and Cloud Audit Logs.
Monitoring and evals
Use LangSmith to:- Capture traces from LLM apps running on Vertex AI.
- Evaluate model outputs via LangSmith datasets.
- Track latency, token usage, and success rates.
- Cloud Monitoring dashboards.
- OpenTelemetry and Prometheus exporters.