When a node fails—from a slow external API, a transient network error, or an unhandled exception—LangGraph gives you three composable mechanisms to respond:Documentation Index
Fetch the complete documentation index at: https://docs.langchain.com/llms.txt
Use this file to discover all available pages before exploring further.
- Retries — automatically re-run failed attempts based on exception type and backoff settings
- Timeouts — cap how long a single attempt may run
- Error handling — run a recovery function after all retries are exhausted
NodeTimeoutError from a timeout), the retry policy decides whether to retry. Only after retries are exhausted does the error handler run.
For stopping a run cleanly at a superstep boundary and resuming later, see Graceful shutdown.
Per-node timeouts and node-level error handlers require
langgraph>=1.2, currently in alpha.Retries
A retry policy automatically re-runs a failed node attempt based on exception type and backoff settings. Passretry_policy= to add_node:
Default behavior
By default,retry_on uses default_retry_on, which retries on any exception except the following (and their subclasses):
ValueErrorTypeErrorArithmeticErrorImportErrorLookupErrorNameErrorSyntaxErrorRuntimeErrorReferenceErrorStopIterationStopAsyncIterationOSError
requests and httpx, it only retries on 5xx status codes. NodeTimeoutError is retryable by default.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
max_attempts | int | 3 | Maximum number of attempts, including the first. |
initial_interval | float | 0.5 | Seconds before the first retry. |
backoff_factor | float | 2.0 | Multiplier applied to the interval after each retry. |
max_interval | float | 128.0 | Maximum seconds between retries. |
jitter | bool | True | Add random jitter to the interval. |
retry_on | type[Exception] | Sequence[type[Exception]] | Callable[[Exception], bool] | default_retry_on | Exceptions to retry on, or a callable returning True for retryable exceptions. |
Custom retry logic
Pass a callable or exception type toretry_on. Import default_retry_on to extend the default behavior:
Inspect retry state
Useruntime.execution_info inside a node to inspect the current attempt number. This is useful for switching to a fallback when the primary call keeps failing:
execution_info exposes the following fields:
| Attribute | Type | Description |
|---|---|---|
node_attempt | int | Current attempt number (1-indexed). 1 on the first try, 2 on the first retry, etc. |
node_first_attempt_time | float | None | Unix timestamp of when the first attempt started. Constant across retries. |
thread_id | str | None | Thread ID for the current execution. None without a checkpointer. |
run_id | str | None | Run ID for the current execution. None when not provided in config. |
checkpoint_id | str | Checkpoint ID for the current execution. |
task_id | str | Task ID for the current execution. |
execution_info is available even without a retry policy—node_attempt defaults to 1.
Timeouts
Requires
langgraph>=1.2, currently in alpha.timeout= parameter on add_node caps how long a single node attempt may run. Pass a number (seconds), a timedelta, or a TimeoutPolicy for separate run and idle limits:
Run timeout
run_timeout is a hard wall-clock cap on a single attempt. It is never refreshed, regardless of node activity:
NodeTimeoutError, clears any writes from the failed attempt, and lets the retry policy decide whether to retry.
Idle timeout
idle_timeout is a progress-resetting cap. It fires only when the node stops making observable progress for the specified duration—unlike run_timeout, the clock resets whenever the node produces a progress signal:
run_timeout and idle_timeout together. Whichever fires first cancels the attempt.
Progress signals
Under the defaultrefresh_on="auto", the idle clock resets on any of the following:
- State writes via
CONFIG_KEY_SEND - Stream output (yielded async stream chunks)
- Child-task scheduling
- Runtime stream-writer calls
- Any LangChain callback event from the node or its descendants (LLM tokens, tool calls, chain start/end, etc.)
Heartbeat mode
Setrefresh_on="heartbeat" to narrow the refresh source to explicit runtime.heartbeat() calls only. This is useful when you want a strict idle definition that isn’t reset by chatty subordinates:
Manual heartbeats
For long-running async work that doesn’t naturally emit progress signals, callruntime.heartbeat() to manually reset the idle clock:
runtime.heartbeat() is a no-op outside an idle-timed attempt, so you can call it unconditionally.
NodeTimeoutError
When a timeout fires, LangGraph raisesNodeTimeoutError with structured context about which limit was hit:
| Attribute | Type | Description |
|---|---|---|
node | str | Name of the node whose execution timed out. |
elapsed | float | Seconds elapsed before the timeout fired. |
kind | Literal["idle", "run"] | Which timeout fired. |
idle_timeout | float | None | The configured idle timeout (seconds), if any. |
run_timeout | float | None | The configured run timeout (seconds), if any. |
NodeTimeoutError is retryable by default. Combining timeout= with retry_policy= works out of the box—the timeout clock resets on each new attempt, and writes from a timed-out attempt are cleared before the next retry:
Dynamic timeouts with Send
When usingSend to dispatch nodes dynamically (for example, in map-reduce patterns), you can pass a timeout= directly on the Send to override the target node’s static timeout for that specific push:
timeout= is omitted on the Send, the target node’s timeout (set at add_node time) applies. This lets you set a default timeout on the node and tighten it for individual calls.
Error handling
Requires
langgraph>=1.2, currently in alpha.Command. This is useful for compensation flows (Saga patterns) where you want to recover gracefully rather than abort the entire graph.
Pass error_handler= to add_node:
retry_policy is exhausted, or immediately if no retry policy is configured. The retry policy and the error handler stay decoupled: configure when to retry and when to compensate independently.
NodeError
Error handlers receive failure context through a typederror: NodeError parameter, injected by type annotation (the same pattern as runtime: Runtime):
NodeError is a frozen dataclass with two fields:
| Attribute | Type | Description |
|---|---|---|
node | str | Name of the node whose execution failed. |
error | BaseException | The exception raised by the failed node. |
error: NodeError parameter is opt-in. Handlers that don’t need failure context can use simpler signatures like (state) or (state, runtime).
Route with Command
Error handlers can return aCommand to update state and route to a specific node, enabling Saga / compensation patterns:
charge_payment retries on ConnectionError up to 3 times. If retries are exhausted (or the error isn’t a ConnectionError), the handler compensates by updating state and routing to finalize instead of aborting the graph.
Resume-safe failures
Failure provenance is checkpointed. If the graph is interrupted or the process crashes after a node fails but before the handler completes, the handler sees the same
NodeError context when the graph resumes from its checkpoint.Behavior with interrupt()
Subgraph failures
If a node wraps a subgraph and the subgraph raises an unhandled exception, that exception surfaces to the parent node. If the parent node has anerror_handler, the handler fires with the subgraph’s exception in error.error.
Functional API
The sametimeout= and retry_policy= parameters are available on @task and @entrypoint in the functional API:
add_node: NodeTimeoutError is raised on timeout, buffered writes are cleared, and the retry policy decides whether to retry.
Limitations
- Python only: timeouts and error handlers are not available in the JavaScript/TypeScript SDK. Retry policies work in both Python and TypeScript.
- Timeouts are async-only: sync nodes with a
timeoutare rejected at compile time. - One handler per node: each node can have at most one
error_handler. - Handler failures bubble up: if the error handler itself raises, that exception propagates as if the node had no handler.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

