← Back to Blog
For: AI Engineers, ML Engineers, Platform Engineers, AI Systems Architects

The Snapshot Tax: Why AG-UI's STATE_DELTA Drifts in Production

The protocol promises automatic state sync. It quietly hands you the hardest distributed-systems problems and hides the failures.

#ag-ui#state-synchronization#json-patch#human-in-the-loop#agentic-ui

A user on a train approves an email your agent is about to send. The approval card shows one recipient. The agent sends to a different one. No error fired. The UI just drifted.

This is the failure mode nobody warns you about when you wire up AG-UI's shared state. There is no exception, no red banner, no failed request in the network tab. Nobody wrote a line of buggy application code. The frontend's copy of the agent's state and the agent's actual state diverged somewhere around the tunnel, the divergence never healed, and the human in your human-in-the-loop flow approved a thing that was no longer true.

The AG-UI documentation describes state synchronization as a core feature that keeps agent and frontend "perfectly synchronized." That phrasing is doing a lot of work. The protocol gives you two events - STATE_SNAPSHOT and STATE_DELTA - and a JSON Patch wire format, and then leaves you holding every hard problem that real-time distributed state has ever produced: out-of-order delivery, dropped messages on reconnect, patches applied against a base that has moved, and two writers racing on the same field. None of that is in the protocol. All of it is in your production incident channel.

The thesis

AG-UI's STATE_DELTA is marketed as automatic, bandwidth-efficient state synchronization. It is neither automatic nor reliably efficient. It is an ordered stream of JSON Patch operations with no built-in sequencing, no gap detection, and a reference implementation that swallows failed patches and never resynchronizes. The dominant failure in production is not a crash. It is silent drift - a UI that disagrees with the agent and stays that way until something downstream goes wrong.

And the bandwidth efficiency that justified choosing deltas over snapshots in the first place? On any connection that drops - which is to say, every mobile connection - you pay it back as full-snapshot recovery, often enough that the cumulative cost exceeds what whole-state updates would have cost you. I call that the snapshot tax, and almost nobody is measuring it, because the protocol's own code hides the moments where the tax comes due.

Why this matters more than it looks

If shared state were only driving a progress spinner, drift would be cosmetic. It is not. In AG-UI, shared state is the substrate for human-in-the-loop collaboration. The documented pattern is that the agent writes a proposed action into state - a recipient, a dollar amount, a destination - the frontend renders it, and the human approves, rejects, or edits before execution. The approval is meaningful only if what the human sees equals what the agent will act on.

State drift breaks exactly that invariant. The human approves the rendered state; the agent executes its own state; if those two have diverged, the approval is meaningless and you have built an oversight mechanism that provides the feeling of oversight without the substance. For an internal dashboard that is embarrassing. For an agent that moves money, sends communications, or modifies records, it is the kind of failure that ends up in a postmortem with legal cc'd.

The reason this slips past review is that it works flawlessly in development. On localhost there are no dropped packets, no reconnects, no latency-induced reordering, and your state object is small. Every failure mode in this article is invisible until you ship to users on real networks with real-sized state. That is the worst possible place to discover them.

The wrong way: the reference handler

Here is the delta-application logic straight from the AG-UI state documentation, the pattern most integrations copy verbatim. It uses fast-json-patch to apply incoming operations to the current state.

code
// The pattern the spec ships - and the one that drifts silentlyimport { applyPatch } from "fast-json-patch";function handleStateDelta(state: AppState, event: StateDeltaEvent): AppState {  const { delta } = event;  try {    // Apply JSON Patch ops to current state without mutating the original    const result = applyPatch(state, delta, /* validate */ true, /* mutate */ false);    return result.newDocument;  } catch (error) {    console.warn(      `Failed to apply state patch:\n` +        `Current state: ${JSON.stringify(state, null, 2)}\n` +        `Patch operations: ${JSON.stringify(delta, null, 2)}\n` +        `Error: ${error}`    );    // emitNoUpdate(): keep the old state, drop the delta, carry on    return state;  }}

Read the catch block again, because that is the whole problem. When a patch fails to apply - because it targets a path that does not exist yet, or replaces a value whose base has moved - the handler logs a warning to a console no user will ever see and then returns the old state unchanged. The protocol documentation describes this as handling errors "gracefully." It is not graceful. It is a silent drop.

From this moment, the frontend's state is missing an update the agent believes was delivered. Every subsequent delta is computed by the agent against a base that includes the dropped change. So the next patch is more likely to fail too - it might target a key the dropped delta was supposed to create. The failures compound. The two states drift further apart with each event, and nothing in this code path ever notices, let alone recovers.

The spec's prose does mention recovery: it says that "if inconsistencies are detected after applying patches, the frontend can request a fresh STATE_SNAPSHOT." Read that as the load-bearing sentence it is. Detecting inconsistencies is your job. Requesting the snapshot is your job. The protocol named the escape hatch and then declined to build it. The reference handler above detects nothing and requests nothing.

There is a second, quieter defect here: ordering. STATE_DELTA events are a sequence, and JSON Patch is order-dependent - an add followed by a replace on the same path is not the same as the reverse. AG-UI streams these over SSE or WebSockets, and the handler trusts that what arrives is contiguous and in order. Across a reconnect, that assumption is simply false. SSE will happily resume the stream having skipped the events that were in flight when the socket died, and this code will apply the next delta it sees against a stale base as if nothing happened.

The right way: sequence, verify, and budget the resync

A correct handler treats the delta stream the way you would treat any unreliable ordered log: it numbers the messages, refuses to apply a patch it cannot prove is the next one in line, verifies the precondition before mutating, and falls back to a snapshot when it cannot. Three mechanisms, each closing one of the gaps above.

First, the backend stamps every state event with a monotonic sequence number. AG-UI lets you attach arbitrary fields to events, so this rides along without leaving the protocol.

code
# Backend (LangGraph + CopilotKit): emit STATE_DELTA events stamped with a# monotonic sequence so the frontend can detect gaps before applying.from copilotkit import CopilotKitStatefrom ag_ui.core import StateDeltaEvent, EventTypeclass StateEmitter:    def __init__(self) -> None:        self._seq = 0    def next_seq(self) -> int:        self._seq += 1        return self._seq    def build_delta_event(self, base_seq: int, operations: list[dict]) -> StateDeltaEvent:        # base_seq = the sequence this patch was computed against.        # seq + base_seq ride as sibling fields on the event, not inside the        # state blob, so the frontend reads them before touching state.        return StateDeltaEvent(            type=EventType.STATE_DELTA,            delta=operations,            # AG-UI events accept additional fields; the middleware passes them through.            seq=self.next_seq(),            base_seq=base_seq,        )

Second, the frontend refuses to apply out-of-order or stale-base patches, and uses a JSON Patch test operation as a cheap precondition check before committing the real ops. The test op is part of RFC 6902 and is listed in AG-UI's own operation enum - it exists precisely so a patch can assert "the document still looks like what I computed against" and fail loudly if it does not. The reference handler never uses it. This one does.

code
import { applyPatch, type Operation } from "fast-json-patch";interface SequencedDelta {  seq: number;  base_seq: number;  delta: Operation[];}type ApplyOutcome =  | { kind: "applied"; state: AppState; seq: number }  | { kind: "resync" }; // caller must request a STATE_SNAPSHOTfunction applySequencedDelta(  state: AppState,  lastSeq: number,  event: SequencedDelta,  // optional: a test op the backend computed to guard the patch  precondition?: Operation[]): ApplyOutcome {  // Gap detection: this patch must build directly on what we last applied.  if (event.base_seq !== lastSeq) {    return { kind: "resync" };  }  // Precondition: verify the base looks as the backend expected.  // A failing `test` op throws here, before any mutation is committed.  if (precondition && precondition.length > 0) {    try {      applyPatch(state, precondition, true, false);    } catch {      return { kind: "resync" };    }  }  try {    const result = applyPatch(state, event.delta, true, false);    return { kind: "applied", state: result.newDocument, seq: event.seq };  } catch {    // Do NOT keep the old state and move on. A failed patch means drift.    return { kind: "resync" };  }}

Third, the caller honors a resync outcome by requesting a fresh STATE_SNAPSHOT and replacing local state wholesale - the one operation the spec is unambiguous about. The difference from the reference handler is total: instead of silently absorbing failure and drifting, this path detects the problem and heals it.

code
function onStateDelta(event: SequencedDelta): void {  const outcome = applySequencedDelta(store.state, store.lastSeq, event);  if (outcome.kind === "resync") {    metrics.increment("agui.state.resync"); // the snapshot tax, made visible    requestStateSnapshot(); // server replies with STATE_SNAPSHOT; we replace    return;  }  store.state = outcome.state;  store.lastSeq = outcome.seq;}

Note the metrics.increment line. That is not decoration. It is the only way you will ever know what the next section is about.

The snapshot tax

You chose STATE_DELTA over STATE_SNAPSHOT for one reason: bandwidth. Sending the three fields that changed beats re-sending a 40 KB document on every tick. That logic is sound on a clean connection.

The snapshot tax is what the delta strategy actually costs once you add the recovery path that correctness requires. Every gap, every failed precondition, every reconnect forces a full-snapshot resync. Each resync sends the entire state object - the very payload deltas existed to avoid. On a stable desktop connection the tax is near zero and deltas win comfortably. On a mobile connection that renegotiates every time the user walks behind a building, resyncs fire often, and the math inverts: the bytes you spend re-snapshotting can exceed the bytes you saved by sending deltas at all.

The trap is that the tax is invisible by default. The reference handler never resyncs, so it never appears to pay anything - it just drifts for free. A correct handler does pay, and unless you instrument it, you will not see the cost until someone asks why mobile data usage tripled. The agui.state.resync counter above, divided by total deltas, is your tax rate. Watch it per connection type. If it climbs on mobile, the protocol is telling you that for that segment, deltas are a worse deal than the snapshots you were trying to avoid - and the correct response may be to send snapshots on a timer and skip deltas entirely for those clients.

This is the reframe the ecosystem is missing. Deltas versus snapshots is not "efficient versus wasteful." It is a bet on connection stability, and the snapshot tax is the price of being wrong. You should know your tax rate before you ship, not after.

Three sync failures, plus an attack surface

The drift story has three distinct mechanisms, and the fixes above map onto them directly.

Patch against a stale base. The agent computes a replace at /order/total assuming the prior value. A delta that should have set /order/items never landed on the frontend, so the totals object is out of date. The replace succeeds mechanically - JSON Patch does not know the base is wrong - and now the displayed total is internally inconsistent. The test-op precondition is what turns this from a silent corruption into a detected resync.

The reconnect gap. SSE and WebSocket connections drop. When they resume, the deltas emitted during the outage are gone - SSE has a replay mechanism via Last-Event-ID, but only if both your server and client implement it, and AG-UI does not mandate it. Without sequence numbers, the client cannot even tell that it missed anything; it applies the next delta against a base that skipped three updates. Gap detection via base_seq is the only thing standing between a reconnect and permanent drift.

Concurrent writers. AG-UI shared state is bidirectional by design - both the agent and the user can write. When the user edits a form field while the agent is mid-stream writing to the same path, you have two patches racing against the same base. JSON Patch has no merge semantics and no conflict resolution; last-write-wins, and the loser's change vanishes. The protocol's own best-practices list quietly tells you to "handle state conflicts" and "implement strategies for resolving conflicting updates," which is documentation-speak for "we did not solve this; you must." For anything collaborative, you need to decide explicitly who wins per field, or move to a CRDT-backed store and treat AG-UI deltas as transport rather than truth.

There is a fourth issue that is not a sync failure but lives in the same subsystem, and it is the one most likely to get you breached. Frontend state is sent to the backend in RunAgentInput.state, and the standard integration pattern serializes that state into a system message so the LLM can reason over it. That means your shared state is untrusted user input on a direct path into the model's prompt. A field a user controls - a document title, a form value - becomes prompt content. If you string-interpolate it raw, you have built a prompt-injection channel that does not even require the user to send a chat message; they just type into a synced field. The mitigation is the same discipline you would apply to any untrusted input reaching a model: structured injection rather than raw interpolation, escaping of control sequences, and hard size limits. The size limit does double duty - it also caps the token cost of serializing large state into every turn, which is its own quiet budget leak.

Decision guide

mermaid
flowchart TD
    A["State update to send"] --> B{"Connection stable?<br/>(measure resync rate)"}
    B -->|"Low resync rate"| C["Use STATE_DELTA<br/>with sequence numbers"]
    B -->|"High resync rate<br/>(mobile, flaky)"| D["Send STATE_SNAPSHOT<br/>on a timer; skip deltas"]
    C --> E{"base_seq == lastSeq?"}
    E -->|"No - gap"| F["Request STATE_SNAPSHOT<br/>increment tax counter"]
    E -->|"Yes"| G{"test-op precondition holds?"}
    G -->|"No"| F
    G -->|"Yes"| H["Apply delta<br/>advance lastSeq"]
    F --> I["Replace local state<br/>wholesale from snapshot"]
    I --> J["Resume applying deltas<br/>from new baseline seq"]
    J --> E

    style A fill:#4A90E2,stroke:#2c5aa0,color:#fff
    style B fill:#7B68EE,stroke:#5544bb,color:#fff
    style C fill:#6BCF7F,stroke:#3a9a4e,color:#fff
    style D fill:#FFD93D,stroke:#c9a800,color:#fff
    style E fill:#7B68EE,stroke:#5544bb,color:#fff
    style F fill:#E74C3C,stroke:#a82315,color:#fff
    style G fill:#7B68EE,stroke:#5544bb,color:#fff
    style H fill:#6BCF7F,stroke:#3a9a4e,color:#fff
    style I fill:#98D8C8,stroke:#5fa697,color:#fff
    style J fill:#98D8C8,stroke:#5fa697,color:#fff

Before you ship AG-UI shared state, confirm:

  • Sequence every state event. Stamp seq and base_seq on every STATE_DELTA event; the frontend rejects any delta whose base does not match the last applied sequence.
  • Never silently drop a failed patch. A patch that does not apply means drift. Resync, do not emitNoUpdate().
  • Use test-op preconditions on deltas whose correctness depends on the base value, so a moved base fails loudly instead of corrupting quietly.
  • Instrument the snapshot tax. Track resync / total_deltas per connection type. If mobile climbs, switch that segment to timed snapshots.
  • Decide concurrent-write semantics explicitly. Per-field ownership or a CRDT. Do not leave it to last-write-wins by accident.
  • Treat synced state as untrusted prompt input. Structured injection, escaping, and size caps - the size cap also bounds token cost.
  • Test on a real degraded network, not localhost. Throttle, drop the connection mid-stream, and confirm the UI heals rather than drifts.

The protocol is genuinely useful - standardizing the agent-to-user layer is overdue, and AG-UI's event model is a reasonable foundation. But "synchronized" is a claim the protocol makes and your code has to keep. The snapshot tax is always being levied; the only choice you have is whether you measure it or let it accumulate silently until a user on a train approves the wrong thing.

References


Agentic AI

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:


Comments