- Retries — automatically re-run failed attempts based on exception type and backoff settings
- Timeouts — cap how long a single attempt may run
- Error handling — run a recovery function after all retries are exhausted
setNodeDefaults to configure these mechanisms once for all nodes instead of repeating them on every addNode call.
These compose in a fixed order: when a node attempt raises any exception (including NodeTimeoutError from a timeout), the retry policy decides whether to retry. Only after retries are exhausted does the error handler run.
For stopping a run cleanly at a superstep boundary and resuming later, see Graceful shutdown.
Per-node timeouts and node-level error handlers require
@langchain/langgraph>=1.4.0.Retries
A retry policy automatically re-runs a failed node attempt based on exception type and backoff settings. PassretryPolicy to addNode:
Default behavior
Retries are opt-in. A node retries only when it has aretryPolicy configured, either directly or through graph defaults with setNodeDefaults. An empty policy ({}) is enough. Without a policy, the first failure ends the attempt and LangGraph does not call retryOn.
If the policy omits retryOn, LangGraph uses a built-in handler that retries thrown errors except:
- Abort and cancellation errors:
error.name === "AbortError", orerror.messagestarts with"Cancel"or"AbortError" GraphValueError, matched byerror.name- Aborted connections:
error.code === "ECONNABORTED" - HTTP client errors with status 400, 401, 402, 403, 404, 405, 406, 407, or 409, read from
error.response?.statusorerror.statusfor clients such asfetch, Axios, and similar clients - OpenAI-style quota errors:
error.error?.code === "insufficient_quota"
retryOn. NodeTimeoutError is not on this blocklist, so it is retryable when a retry policy is configured.
Some failures bypass retryOn. Graph control-flow errors, such as GraphInterrupt and Command routing, bubble up without retrying. An aborted run signal also stops the retry loop, even if retryOn would return true.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
maxAttempts | number | 3 | Maximum number of attempts, including the first. |
initialInterval | number | 500 | Milliseconds before the first retry. |
backoffFactor | number | 2.0 | Multiplier applied to the interval after each retry. |
maxInterval | number | 128000 | Maximum milliseconds between retries. |
jitter | boolean | true | Add random jitter to the interval. |
retryOn | (error: unknown) => boolean | built-in handler (when policy is set) | Callable returning true for retryable exceptions. Only used when retryPolicy is configured. |
logWarning | boolean | true | Whether to log a warning when a retry is attempted. |
Custom retry logic
Pass a callable toretryOn. Unlike Python, there is no exported defaultRetryOn helper—implement your own predicate:
Inspect retry state
Use execution info inside a node to inspect the current attempt number. This is useful for switching to a fallback when the primary call keeps failing:executionInfo exposes the following fields:
| Attribute | Type | Description |
|---|---|---|
nodeAttempt | number | Current attempt number (1-indexed). 1 on the first try, 2 on the first retry, etc. |
nodeFirstAttemptTime | number | undefined | Unix timestamp (ms) of when the first attempt started. Constant across retries. |
threadId | string | undefined | Thread ID for the current execution. undefined without a checkpointer. |
runId | string | undefined | Run ID for the current execution. undefined when not provided in config. |
checkpointId | string | Checkpoint ID for the current execution. |
checkpointNs | string | Checkpoint namespace for the current execution. |
taskId | string | Task ID for the current execution. |
executionInfo is available even without a retry policy—nodeAttempt defaults to 1.
Timeouts
Requires
@langchain/langgraph>=1.4.0.timeout parameter on addNode caps how long a single node attempt may run. Pass a number (milliseconds) or a TimeoutPolicy for separate run and idle limits:
Run timeout
runTimeout is a hard wall-clock cap on a single attempt. It is never refreshed, regardless of node activity:
NodeTimeoutError, clears any writes from the failed attempt, and lets the retry policy decide whether to retry.
Idle timeout
idleTimeout is a progress-resetting cap. It fires only when the node stops making observable progress for the specified duration—unlike runTimeout, the clock resets whenever the node produces a progress signal:
runTimeout and idleTimeout together. Whichever fires first cancels the attempt.
Progress signals
Under the defaultrefreshOn: "auto", the idle clock resets on any of the following:
- State writes through the graph write path
- Custom stream output via
runtime.writer - Child-task scheduling
- Any LangChain callback event from the node or its descendants (LLM tokens, tool calls, chain start/end, etc.)
Heartbeat mode
SetrefreshOn: "heartbeat" to narrow the refresh source to explicit runtime.heartbeat() calls only. This is useful when you want a strict idle definition that isn’t reset by chatty subordinates:
Manual heartbeats
For long-running work that doesn’t naturally emit progress signals, callruntime.heartbeat() to manually reset the idle clock:
runtime.heartbeat() is a no-op outside an idle-timed attempt, so you can call it unconditionally.
NodeTimeoutError
When a timeout fires, LangGraph raisesNodeTimeoutError with structured context about which limit was hit:
| Attribute | Type | Description |
|---|---|---|
node | string | Name of the node whose execution timed out. |
elapsed | number | Milliseconds elapsed before the timeout fired. |
kind | "idle" | "run" | Which timeout fired. |
timeout | number | The value (ms) of the timeout that fired. |
idleTimeout | number | undefined | The configured idle timeout (milliseconds), if any. |
runTimeout | number | undefined | The configured run timeout (milliseconds), if any. |
isNodeTimeoutError(error) to narrow caught errors in TypeScript.
NodeTimeoutError is retryable by default. Combining timeout with a retry policy works out of the box—the timeout clock resets on each new attempt, and writes from a timed-out attempt are cleared before the next retry:
Dynamic timeouts with Send
When usingSend to dispatch nodes dynamically (for example, in map-reduce patterns), you can pass a timeout directly on the Send to override the target node’s static timeout for that specific push:
Send, the target node’s timeout (set at addNode time) applies. This lets you set a default timeout on the node and tighten it for individual calls.
Error handling
Requires
@langchain/langgraph>=1.4.0.Command. This is useful for compensation flows (Saga patterns) where you want to recover gracefully rather than abort the entire graph.
Pass errorHandler to addNode on StateGraph only (not the base Graph class):
NodeError
Error handlers receive failure context through a typederror: NodeError parameter:
NodeError is a class with two fields:
| Attribute | Type | Description |
|---|---|---|
node | string | Name of the node whose execution failed. |
error | Error | The exception thrown by the failed node. |
error: NodeError parameter is opt-in. Handlers that don’t need failure context can omit the second argument and accept only state.
Route with Command
Error handlers can return aCommand to update state and route to a specific node, enabling Saga / compensation patterns:
chargePayment retries on ConnectionError up to 3 times. If retries are exhausted (or the error isn’t a ConnectionError), the handler compensates by updating state and routing to finalize instead of aborting the graph.
Resume-safe failures
Failure provenance is checkpointed. If the graph is interrupted or the process crashes after a node fails but before the handler completes, the handler sees the same
NodeError context when the graph resumes from its checkpoint.Behavior with interrupt()
Subgraph failures
If a node wraps a subgraph and the subgraph raises an unhandled exception, that exception surfaces to the parent node. If the parent node has an error handler, the handler fires with the subgraph’s exception inerror.error.
Graph defaults
Requires
@langchain/langgraph>=1.4.0.retryPolicy, errorHandler, timeout, or cachePolicy on every addNode call, use setNodeDefaults to configure graph-wide defaults in one place:
stepA and stepB now share the same retry policy, error handler, timeout, and cache policy without any duplication.
Precedence
Per-node values passed directly toaddNode() always override defaults set by setNodeDefaults(). Defaults are resolved at compile() time, so you can call setNodeDefaults() before or after addNode() in any order:
Applicability matrix
Not all defaults apply to all node types. Error-handler nodes (those registered viaaddNode(..., { errorHandler })) are excluded from certain defaults to prevent unsafe behavior:
setNodeDefaults parameter | Applies to regular nodes | Applies to error-handler nodes | Reason |
|---|---|---|---|
retryPolicy | ✅ | ✅ | Handlers should be retried on transient failures |
timeout | ✅ | ✅ | Stuck handlers should be cancelled like stuck regular nodes |
errorHandler | ✅ | ❌ | Handlers must never catch themselves |
cachePolicy | ✅ | ❌ | Caching handler results is unsafe |
Scope
Defaults set on a parent graph are not inherited by subgraphs. Each graph maintains its own defaults.Functional API
Thetimeout option is available on task and entrypoint; task also accepts a retry option (not retryPolicy):
addNode: NodeTimeoutError is raised on timeout, buffered writes are cleared, and the retry policy decides whether to retry. Error handlers are not available on task / entrypoint in the JavaScript/TypeScript SDK—use StateGraph.addNode(..., { errorHandler }) instead.
Graceful shutdown
Cooperative shutdown lets you stop an in-flight graph run after the current superstep completes and save a resumable checkpoint. This is useful for handling SIGTERM signals or any external supervisor that needs to reclaim resources without losing work.Requires
@langchain/langgraph>=1.4.0.RunControl and pass it as control to invoke or stream. Call requestDrain() from any context to signal that the run should stop:
Semantics
Drain is cooperative and operates between supersteps, never preempting work that is already running:| Scenario | Behavior |
|---|---|
| Node mid-execution | Runs to completion. Drain takes effect on the next superstep. |
| Node with a retry policy currently retrying | Retry loop runs to exhaustion or success. Drain takes effect after. |
| Graph finishes naturally on the same tick as drain | Returns normally. Inspect control.drainRequested to distinguish from a normal run. |
| More supersteps remain | Raises GraphDrained(reason). Checkpoint is saved and resumable. |
| Subgraph requests drain | GraphDrained bubbles up through the parent and stops it at its own next superstep boundary. |
Resume after drain
Resume a drained run withinvoke(null, config) using the same thread_id:
Read drain state inside a node
Access drain state through theruntime parameter to adjust node behavior before the superstep boundary is reached:
SIGTERM hook pattern
The recommended pattern for handling process shutdown:requestDrain() does not cancel in-flight async work. For a hard upper bound, pair drain with a graceful timeout and an AbortSignal.Limitations
setNodeDefaultsis not inherited by subgraphs: each graph manages its own defaults independently.- Error handlers are
StateGraph-only: passerrorHandlertoStateGraph.addNode, not the baseGraphclass. Error handlers are not available ontask/entrypoint. - One handler per node: each node can have at most one
errorHandler. - Handler failures bubble up: if the error handler itself throws, that exception propagates as if the node had no handler.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

