State Architecture for Agent Networks: The Resume Is the Dangerous Part

A four-agent order-remediation network crashed at 2:14 AM, mid-refund. The on-call engineer did exactly what the runbook said: re-invoke the graph against the persisted checkpoint. It resumed cleanly. It also issued the refund a second time, emailed the customer a second apology, and re-opened a ticket that a human had already closed forty minutes earlier.

Nothing was broken. The checkpoint was intact. Every field deserialized perfectly. The serializer, the Postgres backend, the thread isolation - all of it worked exactly as designed. The network restored a flawless, internally consistent snapshot of the world as it existed at 2:14 AM and then acted on it at 2:54 AM, as if the intervening forty minutes had not happened.

That is the failure almost nobody designs for. Teams spend their state-architecture effort making sure the checkpoint survives the crash. The checkpoint surviving the crash is the easy half. The hard half is that the checkpoint is a photograph, and your agents resume by treating the photograph as a live feed.

The thesis: persistence is solved, reconciliation is not

Here is the claim this article exists to defend: for a long-running agent network, durable state is a solved problem and stale state is the unsolved one - and checkpointing makes the stale-state problem worse, not better, because it restores the network's confidence in a worldview that has expired.

The consensus framing is that state architecture for agents is about persistence: pick a backend, serialize reliably, resume where you left off. That framing is incomplete in a way that ships double refunds. A crash-free system that never persisted anything is forced to re-observe the world on every step. A checkpointed system is granted permission to skip re-observation - and it takes that permission. The act of persisting state is also the act of caching a view of reality that has no expiry stamp on it. Staleness is not the only thing a checkpointer leaves untouched - durability and reachability are different axes, and persisting the run's past does nothing to narrow what it may do next.

Practitioners have started naming the symptom - an agent that "acts on a description of a world that no longer exists" is the stale world model problem. I want to name the architectural cause and the fix. I call the gap between the network's restored internal state and the current external world Resume Divergence. The longer the gap between checkpoint and resume, the wider the divergence, and the more dangerous the first action after resume becomes. The discipline that closes it - re-observing the world before trusting the snapshot - I call the Reconciliation Barrier. Most production agent stacks have neither concept and pay for the absence in incidents that look like bugs but are actually an architectural category error.

Why this matters more for networks than for single agents

A single long-running agent that resumes on stale state is a contained blast radius: it makes one wrong decision, you see it, you fix it. I covered the single-agent checkpoint-and-resume mechanics in State Management for Agentic Systems, and the failure there is mostly recoverable.

Networks change the math three ways.

First, side effects multiply across agents. In a network, the refund agent issues money, the notification agent sends mail, the ticketing agent mutates external systems. Each is a non-idempotent write to a world outside the checkpoint. A stale resume does not produce one wrong answer - it produces a fan-out of irreversible external actions.

Second, state is shared, so staleness propagates. When the intake agent writes account_balance_observed into shared state and three downstream agents read it, one stale observation poisons every decision derived from it. This is the same shared-mutable-state hazard that Multi-Agent Topology Patterns calls a tear point - except a resume turns a latent tear into a guaranteed one.

Third, the checkpoint is a local snapshot, not a global one. LangGraph checkpoints the graph's channel values at a superstep boundary. In a network with external actors - other services, humans, other agent graphs - that snapshot is a local cut through a distributed system. Chandy and Lamport established in 1985 that a globally consistent snapshot requires coordinated markers across all participants; a single graph's checkpoint has no markers reaching the payments API or the human who closed the ticket. You are recording one process's view and calling it the state of the world.

The wrong way: trust the channels you restored

Here is a shared-state network that looks correct, passes every test where nothing crashes, and is a loaded gun on resume. It uses current LangGraph (1.2.x) and the production-recommended Postgres checkpointer.

code

from typing import Annotated, TypedDictfrom langgraph.graph import StateGraph, START, ENDfrom langgraph.graph.message import add_messagesfrom langgraph.checkpoint.postgres import PostgresSaverimport payments, crm, ticketing  # external side-effectful systemsclass NetworkState(TypedDict):    order_id: str    issue: str    refund_approved: bool    refund_issued: bool    account_balance_observed: float      # read from billing at intake    customer_email_observed: str         # read from CRM at intake    ticket_open: bool                    # read from ticketing at intake    messages: Annotated[list, add_messages]def intake(state: NetworkState) -> dict:    acct = crm.lookup(state["order_id"])    return {        "account_balance_observed": acct.balance,        "customer_email_observed": acct.email,        "ticket_open": ticketing.is_open(state["order_id"]),    }def refund_agent(state: NetworkState) -> dict:    if state["refund_approved"] and not state["refund_issued"]:        # non-idempotent external write        payments.issue_refund(state["order_id"], state["account_balance_observed"])        return {"refund_issued": True}    return {}def notify_agent(state: NetworkState) -> dict:    if state["ticket_open"]:        crm.send_email(state["customer_email_observed"], "Your refund is processed")    return {}builder = StateGraph(NetworkState)builder.add_node("intake", intake)builder.add_node("refund_agent", refund_agent)builder.add_node("notify_agent", notify_agent)builder.add_edge(START, "intake")builder.add_edge("intake", "refund_agent")builder.add_edge("refund_agent", "notify_agent")builder.add_edge("notify_agent", END)with PostgresSaver.from_conn_string("postgresql://...") as cp:    cp.setup()    graph = builder.compile(checkpointer=cp)    config = {"configurable": {"thread_id": "order-8841"}}    graph.invoke({"order_id": "order-8841", "issue": "double charge",                  "refund_approved": True, "refund_issued": False,                  "messages": []}, config)

When this crashes inside refund_agent - after payments.issue_refund has hit the payments API but before the checkpoint persisting refund_issued: True is committed - the last durable checkpoint still says refund_issued: False. The standard resume is one line:

code

graph.invoke(None, config)   # resume from last checkpoint

Resume re-runs the node. refund_issued is still False in the restored state, so the guard passes, and payments.issue_refund fires again. This is not a LangGraph defect. Resume re-executes the node; it does not continue from the line after the crash. Any LLM call, API request, or money movement inside that node runs again. The official durable-execution guidance is explicit that nodes after the last checkpoint re-run on resume - which is exactly why idempotency is non-negotiable for side-effectful nodes.

And even if the node were idempotent, account_balance_observed, customer_email_observed, and ticket_open were captured at intake. By resume time the balance may have changed, the customer may have updated their email, and a human may have closed the ticket. notify_agent cheerfully emails a "refund processed" message about a ticket that a person resolved manually forty minutes ago. The state is internally perfect and externally a lie. That is Resume Divergence in one screen of code.

The right way: classify state by perishability, then gate it

The fix is not a better backend. It is recognizing that the fields in your state object do not all age the same way, and architecting around that.

Some channels record decisions and derivations the network itself produced: order_id, issue, refund_approved. These are true the moment they are written and stay true forever. A checkpoint is the authoritative source for them. Call this settled state.

Other channels record observations of an external world the network does not control: account_balance_observed, customer_email_observed, ticket_open, and the external truth of whether the refund actually went through. These are true only at the instant of observation and rot continuously after. A checkpoint is a cache of them, never the source. Call this perishable state.

The architectural rule follows directly: settled state is trusted across a resume boundary; perishable state is not. Before any agent acts on a perishable channel after a resume, that channel must be re-observed from its system of record and reconciled against what the checkpoint claims. That gate is the Reconciliation Barrier. (It is not LangGraph's parallel-branch sync barrier, which only waits for fan-in branches to finish - the Reconciliation Barrier re-derives state against the live world before execution continues.)

Here is the same network rebuilt around it.

code

from typing import Annotated, TypedDictfrom datetime import datetime, timezonefrom langgraph.graph import StateGraph, START, ENDfrom langgraph.graph.message import add_messagesfrom langgraph.checkpoint.postgres import PostgresSaverimport payments, crm, ticketingSCHEMA_VERSION = 3def now_iso() -> str:    return datetime.now(timezone.utc).isoformat()class NetworkState(TypedDict):    schema_version: int    order_id: str                 # settled    issue: str                    # settled    refund_approved: bool         # settled    # perishable channels - never trusted across a resume:    account_balance_observed: float    customer_email_observed: str    ticket_open: bool    refund_confirmed: bool        # external truth, must be re-verified    fresh_as_of: str              # ISO timestamp of last observation    messages: Annotated[list, add_messages]def migrate(state: NetworkState) -> dict:    """Bring an old checkpoint's schema up to current. LangGraph has no    first-class migration, so the settled fields are ours to backfill.    Returns a partial-state update; the graph persists what we return."""    v = state.get("schema_version", 1)    updates: dict = {}    if v < 2:        # v2 renamed the settled field `issue_text` -> `issue`        updates["issue"] = state.get("issue") or state.get("issue_text", "")        v = 2    if v < 3:        # v3 added the settled `refund_approved` gate; runs that predate        # it were auto-approved, so preserve their behavior        updates["refund_approved"] = state.get("refund_approved", True)        v = 3    updates["schema_version"] = SCHEMA_VERSION    return updatesdef reconciliation_barrier(state: NetworkState) -> dict:    """Runs first on every entry. Migrates schema, then re-observes    perishable state before any agent is allowed to act."""    updates = migrate(state)  # settled fields: backfilled once    acct = crm.lookup(state["order_id"])    # System of record is authoritative for whether money already moved:    refund_confirmed = payments.refund_exists(        idempotency_key=f"{state['order_id']}:refund"    )    updates.update(        account_balance_observed=acct.balance,        customer_email_observed=acct.email,        ticket_open=ticketing.is_open(state["order_id"]),        refund_confirmed=refund_confirmed,   # perishable: re-derived every entry        fresh_as_of=now_iso(),    )    return updatesdef refund_agent(state: NetworkState) -> dict:    if state["refund_approved"] and not state["refund_confirmed"]:        # idempotency key makes a re-run a no-op at the API, not a double refund        payments.issue_refund(            state["order_id"],            state["account_balance_observed"],            idempotency_key=f"{state['order_id']}:refund",        )    return {}def notify_agent(state: NetworkState) -> dict:    # acts only on freshly reconciled perishable state    if state["ticket_open"] and state["refund_confirmed"]:        crm.send_email(state["customer_email_observed"], "Your refund is processed")    return {}builder = StateGraph(NetworkState)builder.add_node("reconciliation_barrier", reconciliation_barrier)builder.add_node("refund_agent", refund_agent)builder.add_node("notify_agent", notify_agent)builder.add_edge(START, "reconciliation_barrier")builder.add_edge("reconciliation_barrier", "refund_agent")builder.add_edge("refund_agent", "notify_agent")builder.add_edge("notify_agent", END)

Three things changed, and all three matter.

The barrier runs first, always. It does not check whether this is a resume - it re-observes unconditionally. Trying to detect "are we resuming?" is a trap; the cheaper and safer invariant is that perishable channels are always re-derived at entry, so a cold start and a resume converge on the same fresh observation.

Side effects are keyed for idempotency and verified against the system of record. refund_agent no longer trusts a boolean in its own state to decide whether money moved. It asks the payments system, and it issues with an idempotency key so a re-execution collapses to a no-op at the API boundary. This is the agent-network version of Flink's end-to-end exactly-once model: you do not prevent replay, you make replay safe by externalizing the dedup to a transactional system of record. That externalized dedup is also where an action's reversibility actually lives - a placement agent's power to undo is borrowed from the checkpoint layer beneath it, not an intrinsic property of the agent.

Schema migration is explicit, because LangGraph gives you nothing here. As of June 2026 there is no documented first-class tooling to migrate a checkpoint written under one state schema into a graph compiled with a different one. The channel_versions field in a checkpoint is execution bookkeeping - it tracks which nodes have seen which channel updates - not schema evolution. Add a refund_approved gate in v3 and redeploy, and an in-flight checkpoint written under v2 resumes with that field simply absent; the refund agent reads a missing key and the failure is silent. Note that this is a problem for settled fields specifically - perishable fields get re-derived on every entry anyway, so they self-heal, while settled fields persist exactly as the old schema wrote them. The fix is to stamp schema_version into state and run an explicit migrate() step inside the barrier. You own this; the framework does not.

Resume Divergence and the Reconciliation Barrier

Naming the two halves makes the architecture decision sharp. A resume that restores the checkpoint and proceeds directly to act is an Optimistic Resume - it bets the world is unchanged. A resume that re-observes and reconciles perishable state before acting is a Reconciled Resume. The terms are deliberate echoes of optimistic concurrency control: optimistic concurrency assumes no conflicting write happened and validates a version on commit; Reconciled Resume assumes the world may have moved and validates observations on entry.

This is the same axis other frameworks are starting to expose as resume modes - the OpenAI Agents SDK discussion of PATCH, OVERWRITE, and FILL_GAPS draws exactly the line between trusting the checkpoint wholesale and re-validating it. But those are graph-wide modes: you pick one policy for the entire run. That is the wrong granularity. Resume safety is a per-channel property, not a graph-wide setting. order_id should be trusted on resume and account_balance_observed should not, in the same graph, on the same resume. Perishability is the criterion that lets you make that call channel by channel instead of betting one blanket mode against your whole state object.

mermaid

flowchart TD
    Crash[Crash mid-run]:::err --> Restore[Restore last checkpoint<br/>durable + perishable channels]:::primary
    Restore --> Choice{Resume strategy}:::decision

    Choice -->|Optimistic Resume| ActStale[Act on restored<br/>perishable channels]:::warn
    ActStale --> Divergence[Resume Divergence:<br/>double refund, stale email,<br/>reopened ticket]:::err

    Choice -->|Reconciled Resume| Barrier[Reconciliation Barrier]:::secondary
    Barrier --> Migrate[Migrate schema<br/>to current version]:::secondary
    Migrate --> Reobserve[Re-observe perishable state<br/>from system of record]:::secondary
    Reobserve --> Verify[Verify side effects<br/>via idempotency key]:::secondary
    Verify --> ActFresh[Agents act on<br/>reconciled state]:::success

    classDef primary fill:#4A90E2,stroke:#3A7BC8,color:#FFFFFF;
    classDef secondary fill:#98D8C8,stroke:#6FB8A8,color:#2C2C2A;
    classDef decision fill:#7B68EE,stroke:#6858DE,color:#FFFFFF;
    classDef warn fill:#FFD93D,stroke:#E0BC2D,color:#2C2C2A;
    classDef success fill:#6BCF7F,stroke:#54B868,color:#2C2C2A;
    classDef err fill:#E74C3C,stroke:#C0392B,color:#FFFFFF;

The barrier is not free - it adds a round of external reads on every entry, including cold starts. That is the price of correctness, and it is far cheaper than the incident. Where re-observation is genuinely expensive, attach a freshness horizon: fresh_as_of records when perishable state was last observed, and a channel is trusted only if observed within its tolerance. Balance might tolerate sixty seconds; "ticket still open" might tolerate zero. Perishability is not binary, it is a half-life - and because the half-life belongs to the channel, you can encode it as one, in a registry the barrier consults:

code

from datetime import datetime, timezone# Half-life per channel, in seconds of tolerable staleness.# Settled channels are absent here; an absent channel is never re-trusted.PERISHABILITY = {    "account_balance_observed": 60,    "customer_email_observed": 3600,    "ticket_open": 0,          # zero tolerance - always re-observe    "refund_confirmed": 0,}def is_stale(channel: str, fresh_as_of: str) -> bool:    ttl = PERISHABILITY.get(channel, 0)            # unknown -> treat as perishable    age = (datetime.now(timezone.utc)           - datetime.fromisoformat(fresh_as_of)).total_seconds()    return age > ttl

The barrier re-observes only the channels is_stale flags, so a resume two seconds after a crash skips the reads that are still fresh, while a resume two hours later refreshes everything. The default for an unregistered channel is zero - unknown means perishable, perishable means distrust.

Shared versus isolated state: the real decision

Once you split channels by perishability, the shared-versus-isolated question answers most of itself.

Settled, network-produced decisions should be shared. refund_approved is a fact the whole network needs and that only one agent produces. Put it in the shared state schema and let every agent read it. This is the default LangGraph model - agents communicate over shared channels.

Perishable observations should be isolated and re-derived per agent. The anti-pattern is the intake agent observing the world once and writing it into shared state for everyone else to consume. That single observation is the staleness amplifier: it is captured at one instant and read at many later instants by agents that each assume it is current. Each agent that acts on the external world should re-observe the slice it needs, at the moment it needs it, inside its own scope.

LangGraph gives you the partitioning primitives to enforce this. Subgraphs carry their own checkpoint_ns (checkpoint namespace), so a subagent's private channels are isolated from the parent graph's state and from sibling subagents. Use a private subgraph schema for an agent's working observations, and only promote durable conclusions up to the shared parent state. The boundary between shared and private is not a style choice; it is the boundary between state that is safe to cache and state that must be re-observed.

There is a third tier the checkpointer does not cover at all. The checkpointer is thread-scoped - keyed by thread_id, it is short-term memory for one run of one task. The Store (BaseStore) is cross-thread, long-term memory namespaced by an application key like a user id. Durable facts that outlive a single task - a customer's standing preferences, prior resolutions - belong in the Store, not the checkpoint. Mixing the two is a common partitioning error: people stuff long-lived facts into thread state and then lose them the moment a new thread_id starts a new task.

Serialization is an attack surface, not a formatting detail

State architecture decisions are also security decisions, and this is where teams get hurt because the default behavior changed underneath them.

LangGraph's default serializer is JsonPlusSerializer, with msgpack as the primary encoding. Two checkpoint-serializer CVEs in the last year define the current posture. CVE-2025-64439 (fixed in langgraph-checkpoint 3.0.0) was a remote-code-execution path where the serializer fell back to a "json" mode on illegal Unicode surrogates and could be coerced into constructor-style deserialization of arbitrary Python. CVE-2026-27794 (fixed in checkpoint 4.0.0, which flipped the default pickle_fallback from True to False) was an RCE through a pickle fallback - if an attacker could poison a shared cache or checkpoint store, a read turned into code execution.

The architectural takeaways are concrete. Run langgraph-checkpoint 4.x or newer so pickle_fallback=False is the default. Do not put objects in state that require the pickle fallback to serialize - if a channel will not round-trip through msgpack and JSON-native types cleanly, that is a design smell, not a reason to re-enable pickle. Set LANGGRAPH_STRICT_MSGPACK=true to constrain deserialization. And treat your checkpoint store with the same trust boundary as a database that executes code on read - because until very recently, that is exactly what it was. The blast radius is the entire network's state, shared across every agent.

Checkpoint bloat: the state object is not a filing cabinet

Every superstep re-serializes the channels that changed, and in a chatty network the messages channel and any large payloads in state get written over and over. Drop a 50MB document into state and carry it across ten supersteps and you have written it to Postgres many times - TOAST bloat, write amplification, and a checkpoint history that becomes expensive to list and load. The pattern is to keep large or binary artifacts in object storage and put only a reference - a key, a URL, a hash - in state. The same logic applies to message history: an unbounded messages list is unbounded checkpoint growth, and summarising that list bounds the growth rate without reducing what is already on disk. State is a coordination ledger, not a filing cabinet.

Concurrency: two resumers, one thread, no referee

One more network-specific hazard. LangGraph has no built-in coordination preventing two processes from resuming the same thread_id at the same time. Your watchdog re-invokes a stalled task at the same moment the original process recovers, and now two executions race on the same checkpoint thread, both issuing side effects. The checkpointer protects state consistency within a single writer; it does not arbitrate between two writers - and within a single run, that arbitration is the reducer's job, one state key at a time.

This is not LangGraph's job to solve and it will not solve it for you. Put a lease in front of resume: a Postgres advisory lock keyed on thread_id, or a short-TTL lock in Redis, acquired before invoke(None, config) and released at completion. If you cannot acquire the lease, someone else owns the resume - back off. The idempotency keys on side effects are your second line of defense for when the lease is missed, which is why you need both.

When checkpointing is the wrong tool entirely

Be honest about the boundary. LangGraph checkpointing protects against application-level failures - a bad branch, a tool error, a human-in-the-loop pause - and it does it well. It does not give you infrastructure-level durable execution: there is no failure detection, no automatic resumption, no concurrency coordination, and replay re-runs side effects. If your requirement is "the network must survive process crashes and partitions and resume itself with exactly-once side effects automatically," that is a durable-execution engine's job - Temporal, DBOS, or Restate - with LangGraph running the agent logic on top. The mistake is asking the checkpointer to be a durable-execution runtime. It is a redo log, in ARIES terms; it has no undo or compensation half, and it does not detect that the writer died.

Decision guide: architecting state for an agent network

Work through these before you ship a long-running network.

Classify every channel as settled or perishable. Settled = the network produced it and it stays true. Perishable = it observes an external world. If you cannot classify a channel, it is perishable - default to distrust.
Put a Reconciliation Barrier at every entry. Re-observe perishable state and migrate schema before any agent acts. Do not branch on "is this a resume" - re-observe unconditionally.
Make every external side effect idempotent and verified. Use an idempotency key and check the system of record before acting. Never decide "did I already do this" from a boolean in your own state.
Share settled decisions, isolate perishable observations. Promote conclusions to shared state; keep working observations in subgraph-private channels with their own checkpoint_ns. Re-observe per agent, per moment.
Stamp schema_version into state and own the migration. LangGraph ships no documented schema-migration tooling as of June 2026. A renamed settled field is a silent corruption waiting for a resume.
Run langgraph-checkpoint 4.x+, keep pickle_fallback=False, set LANGGRAPH_STRICT_MSGPACK=true. Treat the checkpoint store as a code-execution trust boundary. Keep un-serializable objects out of state.
Reference large payloads, do not store them. Object storage holds the artifact; state holds the pointer. Cap messages growth.
Lease the thread before resuming. Advisory lock on thread_id so two processes never resume the same thread concurrently.
Use the Store for cross-task facts, the checkpointer for within-task state. Do not lose long-lived facts to a new thread_id.
If you need crash-survival and exactly-once automatically, reach for durable execution. LangGraph on top of Temporal/DBOS/Restate, not LangGraph pretending to be one.

The photograph is not the live feed. Persisting state was never the hard part - persisting your network's confidence in a snapshot, and then acting on that confidence after the world moved, is what ships the double refund. Classify every channel by perishability, gate the perishable ones behind a Reconciliation Barrier, and treat resume safety as a per-channel property rather than a backend you bought. Architect the resume, not just the checkpoint.

References

Chandy, K.M. & Lamport, L. (1985). Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems 3(1). https://lamport.azurewebsites.net/pubs/chandy.pdf
Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., Schwarz, P. (1992). ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. ACM TODS 17(1). https://dl.acm.org/doi/10.1145/128765.128770
Apache Flink. An Overview of End-to-End Exactly-Once Processing in Apache Flink with Apache Kafka. https://flink.apache.org/2018/02/28/an-overview-of-end-to-end-exactly-once-processing-in-apache-flink-with-apache-kafka-too/
LangChain. LangGraph 1.0 is Now Generally Available (2025-10-22). https://changelog.langchain.com/announcements/langgraph-1-0-is-now-generally-available
LangChain Docs. Persistence. https://docs.langchain.com/oss/python/langgraph/persistence
LangChain Docs. Durable Execution. https://docs.langchain.com/oss/python/langgraph/durable-execution
LangChain Reference. Durability modes. https://reference.langchain.com/python/langgraph/types/Durability
DeepWiki. LangGraph Checkpointing Architecture. https://deepwiki.com/langchain-ai/langgraph/4.1-checkpointing
GitHub Advisory. CVE-2025-64439 - JsonPlusSerializer json-mode RCE (GHSA-wwqv-p2pp-99h5) (2025-11-05). https://github.com/advisories/GHSA-wwqv-p2pp-99h5
CVE-2026-27794 - BaseCache pickle-fallback RCE (2026-02-26). https://cvereports.com/reports/CVE-2026-27794
Diagrid. Checkpoints Are Not Durable Execution. https://www.diagrid.io/blog/checkpoints-are-not-durable-execution-why-langgraph-crewai-google-adk-and-others-fall-short-for-production-agent-workflows
Pan, T. (2026-04-10). The Stale World Model Problem in Long-Running Agents. https://tianpan.co/blog/2026-04-10-stale-world-model-long-running-agents
OpenAI Agents Python. Agent State Checkpointing and Resumption (PATCH / OVERWRITE / FILL_GAPS resume modes), Issue #2172. https://github.com/openai/openai-agents-python/issues/2172
Microsoft Azure Architecture Center. Event Sourcing pattern. https://learn.microsoft.com/en-us/azure/architecture/patterns/event-sourcing
PyPI. langgraph-checkpoint. https://pypi.org/project/langgraph-checkpoint/

Agentic AI

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:

State Architecture for Agent Networks: The Resume Is the Dangerous Part

The thesis: persistence is solved, reconciliation is not

Why this matters more for networks than for single agents

The wrong way: trust the channels you restored

The right way: classify state by perishability, then gate it

Resume Divergence and the Reconciliation Barrier

Shared versus isolated state: the real decision

Serialization is an attack surface, not a formatting detail

Checkpoint bloat: the state object is not a filing cabinet

Concurrency: two resumers, one thread, no referee

When checkpointing is the wrong tool entirely

Decision guide: architecting state for an agent network

References

Books by Ranjan Kumar

Building Real-World Agentic AI Systems

The ChatML Handbook

The Chat Templates Handbook

Comments

The thesis: persistence is solved, reconciliation is not

Why this matters more for networks than for single agents

The wrong way: trust the channels you restored

The right way: classify state by perishability, then gate it

Resume Divergence and the Reconciliation Barrier

Shared versus isolated state: the real decision

Serialization is an attack surface, not a formatting detail

Checkpoint bloat: the state object is not a filing cabinet

Concurrency: two resumers, one thread, no referee

When checkpointing is the wrong tool entirely

Decision guide: architecting state for an agent network

References

Related Articles

Books by Ranjan Kumar

Building Real-World Agentic AI Systems

The ChatML Handbook

The Chat Templates Handbook

Comments