Software needs to change in production. New requirements, bug fixes, and refactors all eventually land in your graph code. Because LangGraph runs the latest deployed graph against state that has been persisted for existing threads, every change you ship is effectively a backward-compatible API change with respect to your existing checkpoints. Unlike workflow engines that pin a run to the version of code it started with, LangGraph applies the latest graph immediately to every thread, both new threads and threads that resume from a checkpoint. This is convenient: bug fixes propagate to in-flight conversations and agents without ceremony. It also means you must reason about how each change interacts with runs that started under the previous version of the code. There are three categories of compatibility issues to watch for, in roughly the order you will encounter them:Documentation Index
Fetch the complete documentation index at: https://docs.langchain.com/llms.txt
Use this file to discover all available pages before exploring further.
- Technical compatibility: The most common; the new code must still load and execute against existing State.
- Business compatibility: Less common; existing runs should keep following the old business logic even though the code has changed.
- Non-determinism: Only applies to the Functional API.
Technical compatibility
Technical compatibility is the equivalent of an API breaking change in a microservice. The “API” here is the contract between your graph code and the data already persisted by the checkpointer for existing threads. When a thread resumes, LangGraph deserializes the saved state, dispatches it to a node by name, and expects the node to return values that fit the state schema. Common technical breakages:- Renaming or removing a node while threads are paused at or about to enter that node, for example at an
interruptor via a checkpointed conditional edge that still routes to the old name. On resume, LangGraph cannot find the node by its saved name and the run fails. The starting point for resuming a run is the beginning of the node where execution stopped, so a missing node has nowhere to resume from. - Renaming or removing a State key that older checkpoints still contain or that downstream nodes still read.
- Tightening a State field, such as making an
Optionalfield required, narrowing a type, or adding a new required field with no default. Existing checkpoints will not satisfy the new schema.
Recommended patterns
-
Add new state fields as
NotRequired(orOptional[...] = None) so old checkpoints still validate: - Treat removals as deprecations. Keep the field defined on the state for at least one drain cycle, even if no node reads it, so existing checkpoints continue to load.
- Rename through add-then-remove. Add the new field or node alongside the old one, dual-write or route to both for a deprecation window, then remove the old one once you have confirmed no in-flight thread depends on it.
-
Keep node functions tolerant of unknown keys.
TypedDictignores extra keys at runtime, so leftover state from an older code version will not raise unless a node explicitly reads a missing key. -
Use time travel and
graph.get_stateto spot-check existing threads against the new code in a staging deployment before rolling out.
Detecting in-flight threads
Before you remove a node, rename a State key, or otherwise make a change that older threads cannot tolerate, you want to know whether any threads are currently parked on the version of the code you are about to drop. LangGraph itself does not maintain a search index over thread state, so the answer depends on where your graph runs. If you deploy to LangSmith. Use the Agent Server’s thread search to filter by status. Thestatus field accepts idle, busy, interrupted, and error, so you can bulk-query for interrupted or busy threads, optionally narrowed with metadata filters. See Filter by thread status and List threads.
Anywhere LangGraph runs. Use LangSmith tracing to monitor which nodes are being entered and exited in production. This is the most reliable signal that a node or state field is no longer reachable in any active code path.
When you already have a thread_id. Inspect that single thread directly:
graph.get_state(config)returns the latest checkpoint, including which node the thread is paused at and any pending interrupts.graph.get_state_history(config)returns the full chronological list of checkpoints for the thread.
Business compatibility
Sometimes a change is technically valid (every existing checkpoint still loads and every node still resolves), but the meaning of the new graph differs from the old one. The new behavior is correct for new threads, and you do not want to retroactively apply it to threads that started under the old logic. For example, suppose your graph runsintake → triage → respond, and you decide to insert a new policy_check step between triage and respond:
- Threads that have already passed
triageshould continue straight torespond(the old flow). - New threads should run the full new flow.
triage read flow_version from their saved state (or fall through to the v1 default) and skip policy_check. New threads start at intake, are stamped with flow_version=2, and run the new path. Once all v1 threads have completed, you can remove the version flag and the conditional edge.
This pattern only works if you set the version at thread start, before any branch that needs to be versioned. Setting it later means existing threads will not have it set when they need it.
Non-determinism
This category only applies to the Functional API. The Graph API re-enters at the node boundary on resume, so node code is not “replayed” from the start of the function the way Temporal-style workflows are. The Functional API, in contrast, replays the body of an@entrypoint from the beginning when a run resumes, using cached @task results to skip work that has already been done. Two kinds of changes break this model:
- Adding, removing, or reordering
@taskcalls orinterruptcalls that come before the resume point. LangGraph matches cached results and resume values to calls by their position in the replay, so shifting that position can cause the wrong cached value to be replayed against a different call. - Introducing non-deterministic operations outside of a
@task, such astime.time(),random.random(), or a network call inlined in the entrypoint body. On replay these produce different values than they did on the first run, which can change the control flow.
@entrypoint that has in-flight runs, the safest options are:
- Let in-flight runs drain before deploying the change.
- Wrap any new logic in a new
@taskso its results are checkpointed independently. - Register a new entrypoint under a new graph name in
langgraph.jsonfor the new behavior, and route new threads to it.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

