Skip to main content
The Agent Server emits metrics through an OpenTelemetry (OTel) client. Metrics use the lg_api_ name prefix by default (override with METRIC_PREFIX). On self-hosted deployments, use this page to choose a scrape or push backend, enable the metric sets you need, and look up Prometheus names when building dashboards or alerts.

Metric backends

Agent Server splits metrics into two sets:
  • Deployment UI metrics: Surfaced in the LangSmith Deployment UI and exposed on the Agent Server Prometheus scrape endpoint (GET /metrics, format=prometheus) by default.
  • Internal metrics: Operational and debugging metrics used by LangChain operators. Sent to Datadog when configured. On Prometheus, internal metrics appear only when you opt in.
BackendMetric setEnable
Prometheus (scrape GET /metrics)Deployment UI metrics by default. Set EXPOSE_INTERNAL_METRICS_PROMETHEUS=true to also expose internal metrics on the same endpoint.Available when the OTel Prometheus exporter is installed
Datadog (OTLP push)Internal metrics onlySet LSD_DD_API_KEY (or CUSTOM_LSD_DD_API_KEY). Metrics push to https://{LSD_DD_ENDPOINT}/v1/metrics (default endpoint: otlp.us5.datadoghq.com).
Prometheus and Datadog can run at the same time. Datadog receives the internal complement so UI metrics are not duplicated in both backends.

Metric tiers

Each metric is assigned a tier that controls whether internal metrics are recorded:
TierValuePurpose
CRITICAL1Core health and failure signals. Always recorded when internal metrics are enabled, including on dev / dev_free deployments.
INFO2Operational detail for production monitoring. Default ceiling in production (METRIC_MAX_EMITTING_TIER=2).
DEBUG3Deeper diagnostics for troubleshooting. Omitted unless you raise METRIC_MAX_EMITTING_TIER.
DEEP_DEBUG4Verbose diagnostics. Omitted unless you raise METRIC_MAX_EMITTING_TIER.
Set METRIC_MAX_EMITTING_TIER to the highest tier you want recorded for internal metrics. Deployment UI metrics ignore this setting and always emit.

Configure export

Prometheus

To scrape Deployment UI metrics:
  1. Point your Prometheus collector at the Agent Server /metrics endpoint (for example, https://<agent-server-host>/metrics).
  2. Use the default format=prometheus query parameter (or omit it).
To also expose internal metrics on the same endpoint, set:
EXPOSE_INTERNAL_METRICS_PROMETHEUS=true

Datadog

To push internal metrics to Datadog instead of (or alongside) Prometheus:
  1. Set LSD_DD_API_KEY to your Datadog API key. DATADOG_METRICS_ENABLED turns on automatically when the key is present.
  2. Optionally set LSD_DD_ENDPOINT (default: otlp.us5.datadoghq.com) or the legacy alias CUSTOM_LSD_DD_API_KEY / CUSTOM_LSD_DD_ENDPOINT.
Datadog receives only internal metrics. Continue scraping /metrics for Deployment UI metrics in Prometheus or Grafana.

Deployment UI metrics

These metrics have lsd_web_metric=true. They appear on the Prometheus /metrics scrape by default and power the LangSmith Deployment UI. Tier values are listed for reference; these metrics always emit regardless of METRIC_MAX_EMITTING_TIER.
NameTypeTierDescription
lg_api_http_requests_totalCounterINFOTotal HTTP requests to the Agent Server.
lg_api_http_requests_latencyHistogram (milliseconds)INFOHTTP request latency.
lg_api_run_queue_wait_time_1st_attemptHistogram (milliseconds)INFOTime jobs spend waiting in the queue before first processing.
lg_api_num_pending_runsGaugeINFORuns currently pending. On Postgres backends, the Go core is the source; on in-memory backends, the Python collector emits this gauge.
lg_api_num_running_runsGaugeINFORuns currently running. Same runtime split as lg_api_num_pending_runs.
lg_api_workers_maxGaugeCRITICALMaximum worker capacity. Emitted by the Python collector on in-memory runtimes; the Go core emits this on Postgres.
lg_api_workers_activeGaugeCRITICALWorkers currently executing runs.
lg_api_workers_availableGaugeCRITICALWorkers available to accept new runs.
lg_api_pg_pool_maxGaugeCRITICALMaximum Postgres connection pool size.
lg_api_pg_pool_sizeGaugeCRITICALConnections currently managed by the Postgres pool (idle, in use, or being prepared).
lg_api_pg_pool_availableGaugeINFOIdle connections in the Postgres pool.
lg_api_pg_pool_requests_queued_totalCounterCRITICALPostgres connection requests queued because a connection was not immediately available. The OTel Prometheus exporter appends _total to counter names.
lg_api_pg_pool_requests_errors_totalCounterCRITICALPostgres connection request errors (timeouts, queue full, and similar failures).
lg_api_redis_pool_maxGaugeINFOMaximum Redis connection pool size.
lg_api_redis_pool_sizeGaugeINFORedis connections currently in use.
lg_api_redis_pool_availableGaugeINFOIdle connections in the Redis pool.

Internal metrics

These metrics have lsd_web_metric=false. By default they are exported to Datadog when LSD_DD_API_KEY is set. Set EXPOSE_INTERNAL_METRICS_PROMETHEUS=true to include them on the Prometheus /metrics scrape. Internal metrics at or below METRIC_MAX_EMITTING_TIER are recorded; higher-tier metrics are omitted.

Run lifecycle

NameTypeTierDescription
lg_api_run_attempt_started_counterCounterCRITICALRun execution attempts started.
lg_api_run_success_counterCounterCRITICALRuns completed successfully.
lg_api_run_canceled_by_request_counterCounterCRITICALRuns canceled by an explicit cancel request.
lg_api_run_failed_retriable_counterCounterCRITICALRuns failed with a retriable error.
lg_api_run_failed_after_retry_counterCounterCRITICALRuns that failed after exhausting retries.
lg_api_run_exceed_max_attempts_at_start_counterCounterCRITICALRuns rejected at start because max attempts were already exceeded.
lg_api_run_abandoned_by_shutdown_counterCounterCRITICALRuns abandoned during server shutdown.
lg_api_run_set_status_error_counterCounterCRITICALErrors while updating run status.
lg_api_failed_to_fetch_runs_counterCounterCRITICALFailures fetching runs from the queue.
lg_api_run_execution_latencyHistogram (milliseconds)INFOEnd-to-end run execution latency.
lg_api_run_queue_wait_time_retry_attemptHistogram (milliseconds)INFOQueue wait time on retry attempts (after the first).

Streaming and protocol v2

NameTypeTierDescription
lg_api_streaming_data_loss_counterCounterCRITICALStreaming data loss events.
lg_api_stream_publish_latencyHistogram (milliseconds)INFOLatency publishing stream chunks.
lg_api_stream_data_size_bytesHistogramDEBUGSize of published stream payloads in bytes.
lg_api_protocol_v2_buffer_evicted_counterCounterINFOEvent Streaming v2 replay buffer evictions.
lg_api_protocol_v2_event_emitted_counterCounterDEBUGEvent Streaming v2 events emitted.
lg_api_protocol_v2_resume_gap_counterCounterINFOEvent Streaming v2 resume gaps detected during replay.
lg_api_protocol_v2_transport_send_failure_counterCounterINFOEvent Streaming v2 transport send failures.
lg_api_protocol_v2_buffer_sizeGaugeDEBUGCurrent Event Streaming v2 replay buffer occupancy per run. Tune LSD_PROTOCOL_V2_BUFFER_SIZE when this approaches the limit.
lg_api_protocol_v2_replayed_eventsHistogramDEBUGNumber of events replayed on Event Streaming v2 reconnect.

Server and infrastructure

NameTypeTierDescription
lg_api_server_started_counterCounterINFOServer start events.
lg_api_server_requested_to_stop_counterCounterINFOGraceful shutdown requests received.
lg_api_server_stopped_counterCounterINFOServer stop events.
lg_api_graph_recursion_limit_error_counterCounterINFOGraph recursion limit errors.
lg_api_publish_queue_availabilityGaugeCRITICALRedis publish queue availability signal.

See also