A four-agent order-remediation network crashed at 2:14 AM, mid-refund. The on-call engineer did exactly what the runbook said: re-invoke the graph against the persisted checkpoint. It resumed cleanly. It also issued the refund a second time, emailed the customer a second apology, and re-opened a ticket that a human had already closed forty minutes earlier.
Nothing was broken. The checkpoint was intact. Every field deserialized perfectly. The serializer, the Postgres backend, the thread isolation - all of it worked exactly as designed. The network restored a flawless, internally consistent snapshot of the world as it existed at 2:14 AM and then acted on it at 2:54 AM, as if the intervening forty minutes had not happened.
That is the failure almost nobody designs for. Teams spend their state-architecture effort making sure the checkpoint survives the crash. The checkpoint surviving the crash is the easy half. The hard half is that the checkpoint is a photograph, and your agents resume by treating the photograph as a live feed.
The thesis: persistence is solved, reconciliation is not
Here is the claim this article exists to defend: for a long-running agent network, durable state is a solved problem and stale state is the unsolved one - and checkpointing makes the stale-state problem worse, not better, because it restores the network's confidence in a worldview that has expired.
The consensus framing is that state architecture for agents is about persistence: pick a backend, serialize reliably, resume where you left off. That framing is incomplete in a way that ships double refunds. A crash-free system that never persisted anything is forced to re-observe the world on every step. A checkpointed system is granted permission to skip re-observation - and it takes that permission. The act of persisting state is also the act of caching a view of reality that has no expiry stamp on it.
Practitioners have started naming the symptom - an agent that "acts on a description of a world that no longer exists" is the stale world model problem. I want to name the architectural cause and the fix. I call the gap between the network's restored internal state and the current external world Resume Divergence. The longer the gap between checkpoint and resume, the wider the divergence, and the more dangerous the first action after resume becomes. The discipline that closes it - re-observing the world before trusting the snapshot - I call the Reconciliation Barrier. Most production agent stacks have neither concept and pay for the absence in incidents that look like bugs but are actually an architectural category error.
Why this matters more for networks than for single agents
A single long-running agent that resumes on stale state is a contained blast radius: it makes one wrong decision, you see it, you fix it. I covered the single-agent checkpoint-and-resume mechanics in State Management for Agentic Systems, and the failure there is mostly recoverable.
Networks change the math three ways.
First, side effects multiply across agents. In a network, the refund agent issues money, the notification agent sends mail, the ticketing agent mutates external systems. Each is a non-idempotent write to a world outside the checkpoint. A stale resume does not produce one wrong answer - it produces a fan-out of irreversible external actions.
Second, state is shared, so staleness propagates. When the intake agent writes account_balance_observed into shared state and three downstream agents read it, one stale observation poisons every decision derived from it. This is the same shared-mutable-state hazard that Multi-Agent Topology Patterns calls a tear point - except a resume turns a latent tear into a guaranteed one.
Third, the checkpoint is a local snapshot, not a global one. LangGraph checkpoints the graph's channel values at a superstep boundary. In a network with external actors - other services, humans, other agent graphs - that snapshot is a local cut through a distributed system. Chandy and Lamport established in 1985 that a globally consistent snapshot requires coordinated markers across all participants; a single graph's checkpoint has no markers reaching the payments API or the human who closed the ticket. You are recording one process's view and calling it the state of the world.
The wrong way: trust the channels you restored
Here is a shared-state network that looks correct, passes every test where nothing crashes, and is a loaded gun on resume. It uses current LangGraph (1.2.x) and the production-recommended Postgres checkpointer.
from typing import Annotated, TypedDictfrom langgraph.graph import StateGraph, START, ENDfrom langgraph.graph.message import add_messagesfrom langgraph.checkpoint.postgres import PostgresSaverimport payments, crm, ticketing # external side-effectful systemsclass NetworkState(TypedDict): order_id: str issue: str refund_approved: bool refund_issued: bool account_balance_observed: float # read from billing at intake customer_email_observed: str # read from CRM at intake ticket_open: bool # read from ticketing at intake messages: Annotated[list, add_messages]def intake(state: NetworkState) -> dict: acct = crm.lookup(state["order_id"]) return { "account_balance_observed": acct.balance, "customer_email_observed": acct.email, "ticket_open": ticketing.is_open(state["order_id"]), }def refund_agent(state: NetworkState) -> dict: if state["refund_approved"] and not state["refund_issued"]: # non-idempotent external write payments.issue_refund(state["order_id"], state["account_balance_observed"]) return {"refund_issued": True} return {}def notify_agent(state: NetworkState) -> dict: if state["ticket_open"]: crm.send_email(state["customer_email_observed"], "Your refund is processed") return {}builder = StateGraph(NetworkState)builder.add_node("intake", intake)builder.add_node("refund_agent", refund_agent)builder.add_node("notify_agent", notify_agent)builder.add_edge(START, "intake")builder.add_edge("intake", "refund_agent")builder.add_edge("refund_agent", "notify_agent")builder.add_edge("notify_agent", END)with PostgresSaver.from_conn_string("postgresql://...") as cp: cp.setup() graph = builder.compile(checkpointer=cp) config = {"configurable": {"thread_id": "order-8841"}} graph.invoke({"order_id": "order-8841", "issue": "double charge", "refund_approved": True, "refund_issued": False, "messages": []}, config)When this crashes inside refund_agent - after payments.issue_refund has hit the payments API but before the checkpoint persisting refund_issued: True is committed - the last durable checkpoint still says refund_issued: False. The standard resume is one line:
graph.invoke(None, config) # resume from last checkpointResume re-runs the node. refund_issued is still False in the restored state, so the guard passes, and payments.issue_refund fires again. This is not a LangGraph defect. Resume re-executes the node; it does not continue from the line after the crash. Any LLM call, API request, or money movement inside that node runs again. The official durable-execution guidance is explicit that nodes after the last checkpoint re-run on resume - which is exactly why idempotency is non-negotiable for side-effectful nodes.
And even if the node were idempotent, account_balance_observed, customer_email_observed, and ticket_open were captured at intake. By resume time the balance may have changed, the customer may have updated their email, and a human may have closed the ticket. notify_agent cheerfully emails a "refund processed" message about a ticket that a person resolved manually forty minutes ago. The state is internally perfect and externally a lie. That is Resume Divergence in one screen of code.
The right way: classify state by perishability, then gate it
The fix is not a better backend. It is recognizing that the fields in your state object do not all age the same way, and architecting around that.
Some channels record decisions and derivations the network itself produced: order_id, issue, refund_approved. These are true the moment they are written and stay true forever. A checkpoint is the authoritative source for them. Call this settled state.
Other channels record observations of an external world the network does not control: account_balance_observed, customer_email_observed, ticket_open, and the external truth of whether the refund actually went through. These are true only at the instant of observation and rot continuously after. A checkpoint is a cache of them, never the source. Call this perishable state.
The architectural rule follows directly: settled state is trusted across a resume boundary; perishable state is not. Before any agent acts on a perishable channel after a resume, that channel must be re-observed from its system of record and reconciled against what the checkpoint claims. That gate is the Reconciliation Barrier. (It is not LangGraph's parallel-branch sync barrier, which only waits for fan-in branches to finish - the Reconciliation Barrier re-derives state against the live world before execution continues.)
Here is the same network rebuilt around it.
from typing import Annotated, TypedDictfrom datetime import datetime, timezonefrom langgraph.graph import StateGraph, START, ENDfrom langgraph.graph.message import add_messagesfrom langgraph.checkpoint.postgres import PostgresSaverimport payments, crm, ticketingSCHEMA_VERSION = 3def now_iso() -> str: return datetime.now(timezone.utc).isoformat()class NetworkState(TypedDict): schema_version: int order_id: str # settled issue: str # settled refund_approved: bool # settled # perishable channels - never trusted across a resume: account_balance_observed: float customer_email_observed: str ticket_open: bool refund_confirmed: bool # external truth, must be re-verified fresh_as_of: str # ISO timestamp of last observation messages: Annotated[list, add_messages]def migrate(state: NetworkState) -> dict: """Bring an old checkpoint's schema up to current. LangGraph has no first-class migration, so the settled fields are ours to backfill. Returns a partial-state update; the graph persists what we return.""" v = state.get("schema_version", 1) updates: dict = {} if v < 2: # v2 renamed the settled field `issue_text` -> `issue` updates["issue"] = state.get("issue") or state.get("issue_text", "") v = 2 if v < 3: # v3 added the settled `refund_approved` gate; runs that predate # it were auto-approved, so preserve their behavior updates["refund_approved"] = state.get("refund_approved", True) v = 3 updates["schema_version"] = SCHEMA_VERSION return updatesdef reconciliation_barrier(state: NetworkState) -> dict: """Runs first on every entry. Migrates schema, then re-observes perishable state before any agent is allowed to act.""" updates = migrate(state) # settled fields: backfilled once acct = crm.lookup(state["order_id"]) # System of record is authoritative for whether money already moved: refund_confirmed = payments.refund_exists( idempotency_key=f"{state['order_id']}:refund" ) updates.update( account_balance_observed=acct.balance, customer_email_observed=acct.email, ticket_open=ticketing.is_open(state["order_id"]), refund_confirmed=refund_confirmed, # perishable: re-derived every entry fresh_as_of=now_iso(), ) return updatesdef refund_agent(state: NetworkState) -> dict: if state["refund_approved"] and not state["refund_confirmed"]: # idempotency key makes a re-run a no-op at the API, not a double refund payments.issue_refund( state["order_id"], state["account_balance_observed"], idempotency_key=f"{state['order_id']}:refund", ) return {}def notify_agent(state: NetworkState) -> dict: # acts only on freshly reconciled perishable state if state["ticket_open"] and state["refund_confirmed"]: crm.send_email(state["customer_email_observed"], "Your refund is processed") return {}builder = StateGraph(NetworkState)builder.add_node("reconciliation_barrier", reconciliation_barrier)builder.add_node("refund_agent", refund_agent)builder.add_node("notify_agent", notify_agent)builder.add_edge(START, "reconciliation_barrier")builder.add_edge("reconciliation_barrier", "refund_agent")builder.add_edge("refund_agent", "notify_agent")builder.add_edge("notify_agent", END)Three things changed, and all three matter.
The barrier runs first, always. It does not check whether this is a resume - it re-observes unconditionally. Trying to detect "are we resuming?" is a trap; the cheaper and safer invariant is that perishable channels are always re-derived at entry, so a cold start and a resume converge on the same fresh observation.
Side effects are keyed for idempotency and verified against the system of record. refund_agent no longer trusts a boolean in its own state to decide whether money moved. It asks the payments system, and it issues with an idempotency key so a re-execution collapses to a no-op at the API boundary. This is the agent-network version of Flink's end-to-end exactly-once model: you do not prevent replay, you make replay safe by externalizing the dedup to a transactional system of record.
Schema migration is explicit, because LangGraph gives you nothing here. As of June 2026 there is no documented first-class tooling to migrate a checkpoint written under one state schema into a graph compiled with a different one. The channel_versions field in a checkpoint is execution bookkeeping - it tracks which nodes have seen which channel updates - not schema evolution. Add a refund_approved gate in v3 and redeploy, and an in-flight checkpoint written under v2 resumes with that field simply absent; the refund agent reads a missing key and the failure is silent. Note that this is a problem for settled fields specifically - perishable fields get re-derived on every entry anyway, so they self-heal, while settled fields persist exactly as the old schema wrote them. The fix is to stamp schema_version into state and run an explicit migrate() step inside the barrier. You own this; the framework does not.
Resume Divergence and the Reconciliation Barrier
Naming the two halves makes the architecture decision sharp. A resume that restores the checkpoint and proceeds directly to act is an Optimistic Resume - it bets the world is unchanged. A resume that re-observes and reconciles perishable state before acting is a Reconciled Resume. The terms are deliberate echoes of optimistic concurrency control: optimistic concurrency assumes no conflicting write happened and validates a version on commit; Reconciled Resume assumes the world may have moved and validates observations on entry.
This is the same axis other frameworks are starting to expose as resume modes - the OpenAI Agents SDK discussion of PATCH, OVERWRITE, and FILL_GAPS draws exactly the line between trusting the checkpoint wholesale and re-validating it. But those are graph-wide modes: you pick one policy for the entire run. That is the wrong granularity. Resume safety is a per-channel property, not a graph-wide setting. order_id should be trusted on resume and account_balance_observed should not, in the same graph, on the same resume. Perishability is the criterion that lets you make that call channel by channel instead of betting one blanket mode against your whole state object.
flowchart TD
Crash[Crash mid-run]:::err --> Restore[Restore last checkpoint<br/>durable + perishable channels]:::primary
Restore --> Choice{Resume strategy}:::decision
Choice -->|Optimistic Resume| ActStale[Act on restored<br/>perishable channels]:::warn
ActStale --> Divergence[Resume Divergence:<br/>double refund, stale email,<br/>reopened ticket]:::err
Choice -->|Reconciled Resume| Barrier[Reconciliation Barrier]:::secondary
Barrier --> Migrate[Migrate schema<br/>to current version]:::secondary
Migrate --> Reobserve[Re-observe perishable state<br/>from system of record]:::secondary
Reobserve --> Verify[Verify side effects<br/>via idempotency key]:::secondary
Verify --> ActFresh[Agents act on<br/>reconciled state]:::success
classDef primary fill:#4A90E2,stroke:#3A7BC8,color:#FFFFFF;
classDef secondary fill:#98D8C8,stroke:#6FB8A8,color:#2C2C2A;
classDef decision fill:#7B68EE,stroke:#6858DE,color:#FFFFFF;
classDef warn fill:#FFD93D,stroke:#E0BC2D,color:#2C2C2A;
classDef success fill:#6BCF7F,stroke:#54B868,color:#2C2C2A;
classDef err fill:#E74C3C,stroke:#C0392B,color:#FFFFFF;
The barrier is not free - it adds a round of external reads on every entry, including cold starts. That is the price of correctness, and it is far cheaper than the incident. Where re-observation is genuinely expensive, attach a freshness horizon: fresh_as_of records when perishable state was last observed, and a channel is trusted only if observed within its tolerance. Balance might tolerate sixty seconds; "ticket still open" might tolerate zero. Perishability is not binary, it is a half-life - and because the half-life belongs to the channel, you can encode it as one, in a registry the barrier consults:
from datetime import datetime, timezone# Half-life per channel, in seconds of tolerable staleness.# Settled channels are absent here; an absent channel is never re-trusted.PERISHABILITY = { "account_balance_observed": 60, "customer_email_observed": 3600, "ticket_open": 0, # zero tolerance - always re-observe "refund_confirmed": 0,}def is_stale(channel: str, fresh_as_of: str) -> bool: ttl = PERISHABILITY.get(channel, 0) # unknown -> treat as perishable age = (datetime.now(timezone.utc) - datetime.fromisoformat(fresh_as_of)).total_seconds() return age > ttlThe barrier re-observes only the channels is_stale flags, so a resume two seconds after a crash skips the reads that are still fresh, while a resume two hours later refreshes everything. The default for an unregistered channel is zero - unknown means perishable, perishable means distrust.
Shared versus isolated state: the real decision
Once you split channels by perishability, the shared-versus-isolated question answers most of itself.
Settled, network-produced decisions should be shared. refund_approved is a fact the whole network needs and that only one agent produces. Put it in the shared state schema and let every agent read it. This is the default LangGraph model - agents communicate over shared channels.
Perishable observations should be isolated and re-derived per agent. The anti-pattern is the intake agent observing the world once and writing it into shared state for everyone else to consume. That single observation is the staleness amplifier: it is captured at one instant and read at many later instants by agents that each assume it is current. Each agent that acts on the external world should re-observe the slice it needs, at the moment it needs it, inside its own scope.
LangGraph gives you the partitioning primitives to enforce this. Subgraphs carry their own checkpoint_ns (checkpoint namespace), so a subagent's private channels are isolated from the parent graph's state and from sibling subagents. Use a private subgraph schema for an agent's working observations, and only promote durable conclusions up to the shared parent state. The boundary between shared and private is not a style choice; it is the boundary between state that is safe to cache and state that must be re-observed.
There is a third tier the checkpointer does not cover at all. The checkpointer is thread-scoped - keyed by thread_id, it is short-term memory for one run of one task. The Store (BaseStore) is cross-thread, long-term memory namespaced by an application key like a user id. Durable facts that outlive a single task - a customer's standing preferences, prior resolutions - belong in the Store, not the checkpoint. Mixing the two is a common partitioning error: people stuff long-lived facts into thread state and then lose them the moment a new thread_id starts a new task.
Serialization is an attack surface, not a formatting detail
State architecture decisions are also security decisions, and this is where teams get hurt because the default behavior changed underneath them.
LangGraph's default serializer is JsonPlusSerializer, with msgpack as the primary encoding. Two checkpoint-serializer CVEs in the last year define the current posture. CVE-2025-64439 (fixed in langgraph-checkpoint 3.0.0) was a remote-code-execution path where the serializer fell back to a "json" mode on illegal Unicode surrogates and could be coerced into constructor-style deserialization of arbitrary Python. CVE-2026-27794 (fixed in checkpoint 4.0.0, which flipped the default pickle_fallback from True to False) was an RCE through a pickle fallback - if an attacker could poison a shared cache or checkpoint store, a read turned into code execution.
The architectural takeaways are concrete. Run langgraph-checkpoint 4.x or newer so pickle_fallback=False is the default. Do not put objects in state that require the pickle fallback to serialize - if a channel will not round-trip through msgpack and JSON-native types cleanly, that is a design smell, not a reason to re-enable pickle. Set LANGGRAPH_STRICT_MSGPACK=true to constrain deserialization. And treat your checkpoint store with the same trust boundary as a database that executes code on read - because until very recently, that is exactly what it was. The blast radius is the entire network's state, shared across every agent.
Checkpoint bloat: the state object is not a filing cabinet
Every superstep re-serializes the channels that changed, and in a chatty network the messages channel and any large payloads in state get written over and over. Drop a 50MB document into state and carry it across ten supersteps and you have written it to Postgres many times - TOAST bloat, write amplification, and a checkpoint history that becomes expensive to list and load. The pattern is to keep large or binary artifacts in object storage and put only a reference - a key, a URL, a hash - in state. The same logic applies to message history: an unbounded messages list is unbounded checkpoint growth. State is a coordination ledger, not a filing cabinet.
Concurrency: two resumers, one thread, no referee
One more network-specific hazard. LangGraph has no built-in coordination preventing two processes from resuming the same thread_id at the same time. Your watchdog re-invokes a stalled task at the same moment the original process recovers, and now two executions race on the same checkpoint thread, both issuing side effects. The checkpointer protects state consistency within a single writer; it does not arbitrate between two writers.
This is not LangGraph's job to solve and it will not solve it for you. Put a lease in front of resume: a Postgres advisory lock keyed on thread_id, or a short-TTL lock in Redis, acquired before invoke(None, config) and released at completion. If you cannot acquire the lease, someone else owns the resume - back off. The idempotency keys on side effects are your second line of defense for when the lease is missed, which is why you need both.
When checkpointing is the wrong tool entirely
Be honest about the boundary. LangGraph checkpointing protects against application-level failures - a bad branch, a tool error, a human-in-the-loop pause - and it does it well. It does not give you infrastructure-level durable execution: there is no failure detection, no automatic resumption, no concurrency coordination, and replay re-runs side effects. If your requirement is "the network must survive process crashes and partitions and resume itself with exactly-once side effects automatically," that is a durable-execution engine's job - Temporal, DBOS, or Restate - with LangGraph running the agent logic on top. The mistake is asking the checkpointer to be a durable-execution runtime. It is a redo log, in ARIES terms; it has no undo or compensation half, and it does not detect that the writer died.
Decision guide: architecting state for an agent network
Work through these before you ship a long-running network.
- Classify every channel as settled or perishable. Settled = the network produced it and it stays true. Perishable = it observes an external world. If you cannot classify a channel, it is perishable - default to distrust.
- Put a Reconciliation Barrier at every entry. Re-observe perishable state and migrate schema before any agent acts. Do not branch on "is this a resume" - re-observe unconditionally.
- Make every external side effect idempotent and verified. Use an idempotency key and check the system of record before acting. Never decide "did I already do this" from a boolean in your own state.
- Share settled decisions, isolate perishable observations. Promote conclusions to shared state; keep working observations in subgraph-private channels with their own
checkpoint_ns. Re-observe per agent, per moment. - Stamp
schema_versioninto state and own the migration. LangGraph ships no documented schema-migration tooling as of June 2026. A renamed settled field is a silent corruption waiting for a resume. - Run
langgraph-checkpoint4.x+, keeppickle_fallback=False, setLANGGRAPH_STRICT_MSGPACK=true. Treat the checkpoint store as a code-execution trust boundary. Keep un-serializable objects out of state. - Reference large payloads, do not store them. Object storage holds the artifact; state holds the pointer. Cap
messagesgrowth. - Lease the thread before resuming. Advisory lock on
thread_idso two processes never resume the same thread concurrently. - Use the Store for cross-task facts, the checkpointer for within-task state. Do not lose long-lived facts to a new
thread_id. - If you need crash-survival and exactly-once automatically, reach for durable execution. LangGraph on top of Temporal/DBOS/Restate, not LangGraph pretending to be one.
The photograph is not the live feed. Persisting state was never the hard part - persisting your network's confidence in a snapshot, and then acting on that confidence after the world moved, is what ships the double refund. Classify every channel by perishability, gate the perishable ones behind a Reconciliation Barrier, and treat resume safety as a per-channel property rather than a backend you bought. Architect the resume, not just the checkpoint.
References
- Chandy, K.M. & Lamport, L. (1985). Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems 3(1). https://lamport.azurewebsites.net/pubs/chandy.pdf
- Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., Schwarz, P. (1992). ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. ACM TODS 17(1). https://dl.acm.org/doi/10.1145/128765.128770
- Apache Flink. An Overview of End-to-End Exactly-Once Processing in Apache Flink with Apache Kafka. https://flink.apache.org/2018/02/28/an-overview-of-end-to-end-exactly-once-processing-in-apache-flink-with-apache-kafka-too/
- LangChain. LangGraph 1.0 is Now Generally Available (2025-10-22). https://changelog.langchain.com/announcements/langgraph-1-0-is-now-generally-available
- LangChain Docs. Persistence. https://docs.langchain.com/oss/python/langgraph/persistence
- LangChain Docs. Durable Execution. https://docs.langchain.com/oss/python/langgraph/durable-execution
- LangChain Reference. Durability modes. https://reference.langchain.com/python/langgraph/types/Durability
- DeepWiki. LangGraph Checkpointing Architecture. https://deepwiki.com/langchain-ai/langgraph/4.1-checkpointing
- GitHub Advisory. CVE-2025-64439 - JsonPlusSerializer json-mode RCE (GHSA-wwqv-p2pp-99h5) (2025-11-05). https://github.com/advisories/GHSA-wwqv-p2pp-99h5
- CVE-2026-27794 - BaseCache pickle-fallback RCE (2026-02-26). https://cvereports.com/reports/CVE-2026-27794
- Diagrid. Checkpoints Are Not Durable Execution. https://www.diagrid.io/blog/checkpoints-are-not-durable-execution-why-langgraph-crewai-google-adk-and-others-fall-short-for-production-agent-workflows
- Pan, T. (2026-04-10). The Stale World Model Problem in Long-Running Agents. https://tianpan.co/blog/2026-04-10-stale-world-model-long-running-agents
- OpenAI Agents Python. Agent State Checkpointing and Resumption (PATCH / OVERWRITE / FILL_GAPS resume modes), Issue #2172. https://github.com/openai/openai-agents-python/issues/2172
- Microsoft Azure Architecture Center. Event Sourcing pattern. https://learn.microsoft.com/en-us/azure/architecture/patterns/event-sourcing
- PyPI. langgraph-checkpoint. https://pypi.org/project/langgraph-checkpoint/
Related Articles
- Why Single Agents Fail at Scale: The Five-Mode Failure Taxonomy
- The 7 GenAI Architectures Every AI Engineer Should Know
- Agent Versioning and Deployment Strategies: Shipping Agent Updates Without Breaking Running Pipelines
- Global Policy Enforcement vs. Per-Agent Gate Rules: Two Layers That Must Not Collapse Into One