Epistemic Restraint by Design: Boundary-Aware AI Systems

This article is the practitioner companion to my position paper, Epistemic Restraint by Design: A Position Paper on Boundary-Aware AI Systems (PDF). The paper makes the formal case; this piece turns it into code you can run. Every claim here traces back to it.

A deployed assistant tells a user, with no hedge, that a drug interaction is safe when it is not. A multi-agent pipeline summarizes a contract clause that does not exist. A coding agent cites an API method that was never shipped. None of these is exotic. They are the expected behavior of a system optimized to produce fluent continuations under uncertainty, wired into a pipeline that has no representation of what it is allowed to assert.

The field calls this hallucination, and the default engineering reflex is to treat it as something the model got wrong: reach for a bigger model, more retrieval, another round of preference tuning, a stricter prompt. Those interventions help. They also miss where a large share of production damage actually originates.

The thesis: hallucination has a systems-level failure mode

Here is the claim this article owns: at platform scale, hallucination is not only a model problem - it is an architecture problem, and the layer most teams are missing is an explicit, auditable support boundary. A larger model does not tell your pipeline which sources are authoritative, how fresh they must be, what claims are in scope, or what to do when support is absent. Without those system-level controls, a more accurate model still presents weakly supported content in the same confident register as a grounded fact. The practical gap is not only predictive error. It is the absence of an explicit policy for asserting, verifying, or deferring.

This challenges a comfortable consensus - that hallucination is a model-quality dial you turn down with scale and better training. In September 2025, OpenAI researchers argued that hallucinations are a mathematically expected consequence of how language models are trained and evaluated: benchmarks that reward guessing over abstention make confident wrong answers the rational output (Kalai et al., 2025). If confident guessing is baked into the objective, then the deployment around the model - not the next checkpoint - is where you install the discipline to not answer.

I call the framework for that discipline Epistemic Restraint by Design (ERD). It is not a model, a product, or a new decoding trick. It is a way of building the system around the model so that it can state what supports an answer, preserve uncertainty through a workflow, and know when to verify or stop. This is a framework and position piece; it consolidates and operationalizes ideas that a recent cluster of research has been circling, and it does not claim priority for abstention, gating, provenance, or verification.

Why this matters at scale

Model-level error-rate reductions are quoted as small percentages, and at low volume a small percentage is a tolerable number of bad outputs. At platform scale it is not. A system serving on the order of hundreds of millions of users turns a residual fraction-of-a-percent hallucination rate into hundreds of thousands of ungrounded outputs per day. Pushing that residual rate down through model improvements alone runs into diminishing returns - each increment costs more and buys less.

ERD changes the lever. Instead of trying to make every answer correct, it makes the system decline the answers it cannot ground. In health, finance, legal, and safety settings - where the cost of a confident wrong answer is high - a system that defers cleanly on the hard fraction is worth more than one that answers everything slightly more accurately. The real trade is between coverage and trust, and most pipelines make it implicitly and by accident. ERD makes it explicit and tunable.

The wrong way: confidence masking

Here is the pattern in nearly every retrieval-augmented pipeline shipped under deadline. Retrieve some context, staple it to the prompt, return whatever the model says.

code

def answer(query: str) -> str:    docs = retriever.search(query, k=5)    context = "\n\n".join(d.text for d in docs)    # The model always produces a fluent answer, whether or not the    # retrieved context actually supports one. Retrieval that returns    # five irrelevant chunks looks identical to retrieval that nails it.    return llm.generate(        f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"    )

The bug is not in any single line. It is that the function has exactly one exit: a fluent answer. A query about an entity absent from the corpus, an event after the sources' freshness window, or a domain the system was never authorized to speak on all flow through the same path and come out sounding equally authoritative.

This is the consensus trap worth naming directly: Retrieval-Augmented Generation (RAG) is widely treated as the hallucination fix, but it improves the accuracy of in-scope answers while doing almost nothing to stop out-of-scope ones. Retrieval that returns five irrelevant chunks and retrieval that nails the answer produce the same shape of output, because there is no gate that can say "not this one." RAG raises the ceiling on questions you should answer; it does not install a floor under the ones you should not.

I call this failure confidence masking: outputs presented in a single confident register regardless of the evidence available to support them. The model's fluency is doing double duty - it is both the answer and, implicitly, a claim of warrant the system never actually checked. Worse, in multi-step pipelines that confidence launders: a weakly grounded intermediate result gets consumed by a downstream step as if it were established fact, and by the final answer nobody can tell which claims were supported and which were invented three hops back.

The right way: encode the boundary before you generate

The fix starts by separating two things the naive pipeline conflates: the model's parametric knowledge boundary (what the weights happen to know) and the system's support boundary (what this deployment is permitted to assert from its configured sources). ERD treats the support boundary as a property of the deployment, not the model.

Formally, let a boundary decision be a policy-aware predicate over a request q, a time-stamped registry of sources and tools S_t, and a policy P that specifies domain scope, source authority, freshness requirements, allowed claim types, and permitted fallback actions:

code

B(q; S_t, P) -> { IB, OOB, U }

IB is in-boundary, OOB is out-of-boundary, and U is unresolved under the available metadata. A request is in-boundary only when there exists an evidence subset that satisfies every required condition - coverage, authority, freshness, provenance traceability, and policy allowance. The single most important rule: a conservative gate must never treat U as permission to answer.

The first ERD pattern, the Knowledge Gating Layer, implements exactly this predicate and sits in front of generation.

code

from dataclasses import dataclassfrom enum import Enumclass Boundary(Enum):    IB = "in_boundary"    OOB = "out_of_boundary"    U = "unresolved"@dataclassclass Policy:    allowed_domains: set[str]    min_coverage: float      # required evidence-query coverage    min_authority: float     # required source authority    max_age_days: float      # freshness window@dataclassclass Evidence:    text: str    coverage: float          # how well this evidence covers q, in [0, 1]    authority: float         # source authority, in [0, 1]    age_days: float    provenance_ok: bool      # traceable to an approved sourcedef gate(query_domain: str,         evidence_set: list[Evidence],         policy: Policy) -> tuple[Boundary, list[Evidence], str]:    # L: policy allowance is checked first and decides OOB outright.    if query_domain not in policy.allowed_domains:        return Boundary.OOB, [], "domain outside policy scope"    # C, A, F, T: find a subset that satisfies all support conditions.    supported = [        e for e in evidence_set        if e.coverage >= policy.min_coverage        and e.authority >= policy.min_authority        and e.age_days <= policy.max_age_days        and e.provenance_ok    ]    if supported:        return Boundary.IB, supported, "coverage+authority+freshness+provenance met"    # Evidence exists but none clears the policy -> known OOB.    if evidence_set:        return Boundary.OOB, [], "no evidence subset meets support policy"    # Nothing retrieved: the gate cannot resolve support. Never a pass.    return Boundary.U, [], "insufficient metadata to resolve support"

The gate records not just its decision but the condition that produced it - the machine-readable reason trace ρ. That trace is what turns a silent failure into an audit line, the same discipline as logging context and not just actions in agent pipelines. When the drug-interaction query comes in and the corpus has no authoritative, in-freshness pharmacology source, the gate returns OOB with a reason before a single token of a fluent "it's safe" is generated.

Two of those conditions - coverage and provenance_ok - are the hard ones, and worth being honest about. Coverage is really the question "does this evidence actually support the claim," which is itself an attribution or faithfulness judgment; in practice you estimate it with the same tooling you would use to grade a RAG answer - atomic-fact scoring like FActScore, retrieval-faithfulness scoring like RAGAS, or an attribution-to-source check. The uncomfortable part: that estimator can be wrong, so the gate inherits its error - a mis-scored coverage can wave through an unsupported claim or bounce a supported one. ERD does not make that problem disappear. It moves it out of the model's head and into one inspectable component with a logged reason, where you can measure it and tune it.

This is the whole move: the naive pipeline had one exit; ERD gives it three, and forces the system to pick one on the record.

Epistemic Restraint by Design: three principles, three patterns

ERD is three principles, each realized by one composable pattern. The principles say what a boundary-aware system must do; the patterns say how a pipeline does it. Every pattern consumes and emits explicit boundary or epistemic state, so each can be tested in isolation and audited in composition.

Principle 1 - Boundary Encoding. The system holds an explicit, inspectable representation of the evidence conditions under which it may assert a claim: source coverage, authority, freshness, allowed uses, and no-go zones as policy state, not as an implicit hope about model behavior. Realized by the Knowledge Gating Layer above.

Principle 2 - Propagation Awareness. Multi-step systems preserve the epistemic status of intermediate outputs so that a weakly grounded early result cannot silently become a trusted late one. Realized by the Uncertainty Propagation Graph.

Principle 3 - Graceful Deferral. At or near the boundary, the system abstains, clarifies, retrieves, verifies, or escalates rather than guessing fluently - and treats a good deferral as a successful outcome, not a failure. Realized by Verification Routing.

The one-line version, which is worth memorizing: encode the boundary, carry the uncertainty, route on it.

Carrying uncertainty across steps: the Uncertainty Propagation Graph

Propagation Awareness is where multi-agent systems quietly rot, and it is the mechanism behind most multi-agent pipeline failure propagation. A summarizer node produces a plausible-but-thin claim; a planner node treats it as fact; an executor acts on it. The Uncertainty Propagation Graph (UPG) attaches an epistemic-state record to every node's output and carries it down every edge.

A note on naming, since the vocabulary here is crowded: recent work such as UProp (Xu et al., 2025) studies uncertainty propagation as an information-theoretic estimation problem - decomposing and estimating how much uncertainty a step inherits. The UPG is deliberately not that. It is a conservative eligibility bound over a directed acyclic graph (DAG), not a calibrated probability of truth. The two are complementary: use an estimator like UProp to populate the confidence field; use the UPG to enforce a weak-link policy over the whole record.

code

@dataclassclass EpistemicState:    boundary: Boundary    coverage: float      # c: evidence-query coverage    authority: float     # a: source authority    freshness: float     # f: freshness score    provenance: float    # p: provenance integrity    confidence: float    # m: model-side signal - ONE input, not the verdictdef local_eligibility(z: EpistemicState) -> float:    # Weak-link: the least eligible field caps the node. A confident model    # (high m) cannot rescue a node whose evidence coverage (c) is near zero.    return min(z.coverage, z.authority, z.freshness, z.provenance, z.confidence)def node_eligibility(z: EpistemicState, predecessor_scores: list[float]) -> float:    # A downstream result is never more eligible for unverified return    # than its least eligible dependency.    r_local = local_eligibility(z)    if predecessor_scores:        return min(r_local, min(predecessor_scores))    return r_local

The min operator is the entire point. It encodes a weak-link policy: a downstream node cannot launder itself into higher eligibility than its worst input, and model confidence m is only one field among five evidence- and policy-facing ones. This is the structural cure for confidence laundering.

It also has a cost you should see coming: min is conservative, and that conservatism compounds with depth. On a long chain a single weak node caps everything downstream, so the chance that a deep result clears your return threshold falls off fast - a long agent pipeline will over-defer unless you intervene. Two levers. Insert a verification node that re-grounds the weak step against an authoritative source - that is the only thing that should be allowed to raise eligibility again, and only with a logged source. Or, when the chain is long and per-node signals are genuinely independent and calibrated, swap min for a declared soft aggregator that trades a little conservatism for coverage. Keep min when stakes are high, chains are short, or failures are correlated - a shared bad source is exactly what defeats averaging. Whatever you pick, declare it and stress-test it, because the moment aggregation quietly assumes independence it stops being conservative.

Routing on uncertainty instead of masking it

Graceful Deferral turns the eligibility score into an action. Verification Routing takes an output and its epistemic state and chooses the next permitted step - it is a policy, not a vague instruction to "be careful."

code

class Action(Enum):    RETURN = "return"    VERIFY = "verify"    CLARIFY = "clarify"    ESCALATE = "escalate"    DEFER = "defer"def route(z: EpistemicState,          r: float,          tau_return: float,          can_verify: bool,          can_clarify: bool,          can_escalate: bool) -> Action:    # Return only when in-boundary AND eligibility clears the threshold.    if z.boundary is Boundary.IB and r >= tau_return:        return Action.RETURN    # Otherwise prefer the strongest available recovery path, in policy order.    if can_verify:        return Action.VERIFY    if can_clarify:        return Action.CLARIFY    if can_escalate:        return Action.ESCALATE    # No recovery path exists: degrade to an explanatory deferral.    return Action.DEFER

The precedence here is illustrative - a clinical tool might put escalate above verify; a search assistant might collapse everything to clarify or defer. What matters is that the routing decision is a function of explicit state, logged, and tunable. Where no fallback exists at all, routing degenerates to an honest "I can't support an answer to this," which is still strictly better than a fluent guess.

Composed, the three patterns look like this:

mermaid

flowchart LR
    Q[User query] --> KGL{Knowledge Gating Layer<br/>boundary predicate}
    KGL -->|in-boundary| GEN[Generation + UPG<br/>carry epistemic state]
    KGL -->|out-of-boundary| DEF[Defer / escalate<br/>state the limit]
    KGL -->|unresolved| DEF
    GEN -->|r ≥ τ and IB| ANS[Return answer<br/>with provenance]
    GEN -->|low eligibility| VR{Verification Routing}
    VR -->|verify / clarify| GEN
    VR -->|no path| DEF

    classDef gate fill:#FFD93D,stroke:#B8860B,color:#2C2C2A
    classDef gen fill:#4A90E2,stroke:#3A7BC8,color:#FFFFFF
    classDef good fill:#6BCF7F,stroke:#3F9950,color:#2C2C2A
    classDef defer fill:#E74C3C,stroke:#B03A2E,color:#FFFFFF
    classDef route fill:#7B68EE,stroke:#6858DE,color:#FFFFFF
    classDef ctx fill:#95A5A6,stroke:#6C7A7A,color:#FFFFFF
    class Q ctx
    class KGL gate
    class GEN gen
    class ANS good
    class VR route
    class DEF defer

Measuring restraint: the Epistemic Compliance Score

Accuracy, calibration, and the usual RAG quality metrics are necessary but they do not answer the ERD question: did the system comply with its own evidence policy? A model can score well on accuracy by answering everything and getting most of it right, while quietly committing thousands of boundary violations on the requests it should have refused. The Epistemic Compliance Score (ECS) is a cost-sensitive utility built to expose exactly that gap.

Every request is labeled against a frozen boundary specification as in-boundary or out-of-boundary, then adjudicated into one of seven mutually exclusive outcomes:

Count	Outcome	Weight
n1	IB, supported and correct	+1
n2	IB, incorrect or unfaithful	-1
n3	IB, deferred (over-restraint)	-β
n4	OOB, useful deferral (states limit, routes)	+1
n5	OOB, bare but safe deferral	+α
n6	OOB, unsupported but accidentally correct	-δ
n7	OOB, false or misleading	-γ

The two distinctions in the bottom half are what most metrics miss. A useful deferral (n4) that tells the user what is unsupported and offers a next step is worth more than a bare refusal (n5). And an out-of-boundary answer that happens to be true (n6) is still a policy violation - just a less harmful one than a false answer (n7) - so δ sits below γ. Rewarding n6 as if it were a win is how teams train systems to guess confidently and get lucky.

code

def ecs(counts: tuple[int, ...],        alpha: float = 0.25, beta: float = 0.5,        delta: float = 1.25, gamma: float = 2.0) -> float:    n1, n2, n3, n4, n5, n6, n7 = counts    N = sum(counts)    return (n1 + n4 + alpha * n5 - n2 - beta * n3            - delta * n6 - gamma * n7) / N# Ground-truth link to the paper's worked example.# System A answers everything; System B adds a Knowledge Gating Layer.assert round(ecs((50, 10, 0, 0, 0, 5, 35)), 2) == -0.36assert round(ecs((48, 6, 6, 30, 5, 2, 3)), 2) == 0.62

Run those two lines and the whole argument collapses into two numbers. System A - the answer-everything baseline - scores -0.36: it is accurate on the 50 in-boundary questions it gets right, but it also fires 40 out-of-boundary answers, 35 of them wrong, and ECS punishes that. System B adds a single gate: it sacrifices a little in-boundary coverage (6 questions it now defers) and converts most of those 40 boundary violations into useful deferrals, landing at +0.62. Same corpus, same questions, near-identical raw accuracy on what they choose to answer - and a full point of separation on the metric that measures whether the system stayed inside its support boundary.

This is the other consensus belief ERD rejects: that abstention is a cost to be minimized. Because standard benchmarks score a refusal the same as a wrong answer - both are "not correct" - teams learn to disable abstention to protect their leaderboard numbers, which is exactly the incentive OpenAI's inevitability result identifies. ECS inverts that incentive. A useful deferral scores +1, the same as a correct in-boundary answer, so a system is never punished for declining a question it genuinely cannot ground. That single design choice is what makes restraint trainable instead of penalized.

ECS alone is not enough to run a system, and reporting it as a single number hides the failure structure. Pair it with the disaggregated rates that tell you where a system is winning or losing:

code

def _safe(num: float, den: float) -> float | None:    # Undefined when the stratum is empty (e.g. a baseline that never defers).    return num / den if den else Nonedef compliance_rates(counts: tuple[int, ...]) -> dict[str, float | None]:    n1, n2, n3, n4, n5, n6, n7 = counts    oob = n4 + n5 + n6 + n7    return {        "in_boundary_accuracy": _safe(n1, n1 + n2),        # answered IB only        "boundary_violation_rate": _safe(n6 + n7, oob),    # OOB answered at all        "deferral_quality": _safe(n4, n4 + n5),            # useful vs. bare refusal    }

Boundary-violation rate is the number to put on the dashboard: it is the fraction of out-of-boundary requests the system answered when it should have deferred, and it is invisible to accuracy. Deferral quality guards the opposite failure - a system that refuses everything with a bare "I can't help" scores well on violations but badly here. Because the weights encode deployment values, every ECS result must also publish its (α, β, δ, γ) vector and a sensitivity sweep. ECS is a complement to your accuracy and calibration numbers, never a replacement.

If you take one new thing from this piece, take these two handles: confidence masking as the name for the anti-pattern you are almost certainly shipping, and a boundary-compliance metric you can run in ten lines to prove whether a gate actually changed anything. Neither requires a new model.

When to reach for ERD (and when not to)

Before the decision guide, a diagnostic. Map the symptom you are actually seeing in production to the mechanism that addresses it - this is how you decide which of the three patterns to build first rather than adopting all of ERD at once.

Production symptom	ERD mechanism	Principle
System answers questions about entities not in your corpus	Knowledge Gating Layer (`OOB` before generation)	Boundary Encoding
A wrong intermediate claim shows up as fact in the final answer	Uncertainty Propagation Graph (weak-link `min`)	Propagation Awareness
Confident answers to questions past your data's freshness window	Knowledge Gating Layer (freshness condition `F`)	Boundary Encoding
Refusals are bare "I can't help" with no next step	Verification Routing (`clarify` / `escalate`)	Graceful Deferral
High accuracy on benchmarks, complaints about made-up answers	ECS + boundary-violation rate	Measurement

ERD is not free and it is not always the right tool. Use this as a decision guide.

Apply ERD when:

The system has an articulable boundary - a defined corpus, a freshness window, a domain scope, or an authority policy. If you can name the sources you trust, you can encode them.
The cost of a confident wrong answer is high: health, finance, legal, safety, or anything that triggers a compliance review.
The pipeline is multi-step or multi-agent, so intermediate results get consumed downstream and confidence can launder. This is where the UPG earns its cost.
You are hitting diminishing returns from model swaps and prompt tuning, and your remaining failures are ungrounded confident answers rather than reasoning errors.

Do not reach for ERD when:

The deployment is genuinely open-domain with no definable support boundary and low stakes - a brainstorming toy, a creative-writing aid. Gating an open domain just adds latency for a boundary you cannot specify.
The system is a single isolated call with no dependency path. The UPG adds nothing where there is nothing to propagate; a lightweight gate may still help, but skip the graph.
Latency and cost budgets cannot absorb gating, propagation tracking, and verification routing, and the cost of a wrong answer is low enough that confidence masking is an acceptable risk. Be honest about this rather than pretending the trade does not exist.

Implementation checklist:

Write the boundary specification first: source registry, authority hierarchy, freshness windows, permitted claim types, fallback routes. If you cannot write it down, you do not yet have a boundary to enforce.
Put the Knowledge Gating Layer in front of generation and make U a non-answer by default.
Attach an epistemic-state record to every node and aggregate with min until you have calibrated something better.
Make Verification Routing a logged policy function, not a prompt instruction.
Score with ECS and publish the weight vector plus a sensitivity sweep. Track boundary-violation rate as your primary safety number.

The honest limitations

ERD is not a complete solution and it does not make models omniscient. It assumes a capable, reasonably calibrated generator and makes the system around it honest - a perfectly restrained system that grounds nothing is useless. Boundary Encoding also pushes a hard problem to a new place: someone has to define the boundary, and for open domains that is partly human judgment. I consider that a feature, because it turns an implicit decision into an explicit, reviewable one - but it is real work, not a free lunch. And gating, propagation, and routing add inference overhead; the trade is only worth it where a wrong answer is expensive.

Two limits are sharp enough to name. First, the gate protects the boundary, not in-boundary correctness. It cuts the answers you give outside your support, but a confident answer that is in-boundary and clears the eligibility threshold is returned without further checking - so ERD does little for in-boundary factual errors. Those stay a model-and-grounding problem, and if you fear them you add spot-verification of high-stakes in-boundary answers on top of the gate. Second, a wrong boundary is its own failure mode. If the encoded boundary is mis-specified the system will confidently refuse answerable questions, and an uneven boundary can refuse some domains or users more than others. The frozen boundary is ground truth inside the metric; in production it is a policy artifact you have to audit like any other.

The claim is narrow and I will restate it plainly: hallucination is usually treated as something a model gets wrong, and at platform scale that framing leaves the highest-leverage fixes on the table. A deployment needs an explicit support policy, traceable epistemic state, and disciplined behavior at the edge of its permitted evidence. Epistemic Restraint by Design gives that requirement a structure - encode the boundary, carry the uncertainty, route on it - and a way to measure whether you did. The best systems will still pair it with better models, better retrieval, and better calibration. But they will stop pretending that a bigger model is the same thing as a system that knows when to stop talking.

References

Kalai, A. T., Nachum, O., Vempala, S., & Zhang, E. (2025). Why Language Models Hallucinate. arXiv:2509.04664. https://arxiv.org/abs/2509.04664
Ji, Z., Lee, N., Frieske, R., et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1-38.
Li, M., Zhao, Y., Deng, Y., et al. (2024). Knowledge Boundary of Large Language Models: A Survey. arXiv:2412.12472. https://arxiv.org/abs/2412.12472
Ren, R., Wang, Y., Qu, Y., et al. (2023). Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation. arXiv:2307.11019. https://arxiv.org/abs/2307.11019
Kadavath, S., et al. (2022). Language Models (Mostly) Know What They Know. arXiv:2207.05221. https://arxiv.org/abs/2207.05221
Kamath, A., Jia, R., & Liang, P. (2020). Selective Question Answering under Domain Shift. ACL 2020.
Xu, J., et al. (2025). UProp: Investigating the Uncertainty Propagation of LLMs in Multi-Step Agentic Decision-Making. arXiv:2506.17419. https://arxiv.org/abs/2506.17419
Min, S., Krishna, K., Lyu, X., et al. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP 2023. arXiv:2305.14251. https://arxiv.org/abs/2305.14251
Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217. https://arxiv.org/abs/2309.15217
Hintsanen, A. (2026). Hallucination as Output-Boundary Misclassification: A Composite Abstention Architecture for Language Models. arXiv:2604.06195.
Emanuilov, S., & Ackermann, R. (2025). Stemming Hallucination in Language Models Using a Licensing Oracle. arXiv:2511.06073.
Romanchuk, O., & Bondar, R. (2026). Semantic Laundering in AI Agent Architectures: Why Tool Boundaries Do Not Confer Epistemic Warrant. arXiv:2601.08333.
Imperial, J. M., & Madabushi, H. T. (2026). Safer Policy Compliance with Dynamic Epistemic Fallback. arXiv:2601.23094.
Marchal, N., Chan, S., Franklin, M., et al. (2026). Architecting Trust in Artificial Epistemic Agents. arXiv:2603.02960.
Kumar, R. (2026). Epistemic Restraint by Design: A Position Paper on Boundary-Aware AI Systems. Position paper. PDF

Agentic AI

More Articles

Closing the Loop: How to Actually Measure RAG Quality in Production

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:

The thesis: hallucination has a systems-level failure mode

Why this matters at scale

The wrong way: confidence masking

The right way: encode the boundary before you generate

Epistemic Restraint by Design: three principles, three patterns

Carrying uncertainty across steps: the Uncertainty Propagation Graph

Routing on uncertainty instead of masking it

Measuring restraint: the Epistemic Compliance Score

When to reach for ERD (and when not to)

The honest limitations

References

Related Articles

Comments