Design Patterns for SLM-First Systems

The previous article in this series established why SLMs belong in production AI systems as infrastructure - not as smaller GPTs, but as composable intelligence units that handle routing, classification, transformation, and loop control. If you haven't read it, start there: Small Language Models Are Not Smaller GPTs - They're Infrastructure.

This article is about the how. Specifically: what do these patterns look like in code? How do you wire a router SLM into a LangGraph workflow? What does a validator node actually check? How do you build a loop controller that escalates correctly without infinite-looping into bankruptcy?

Every pattern here has been distilled from production systems. The failure modes are real. The code is runnable. By the end you'll have a composable pipeline where each SLM handles exactly one decision type and every path is observable, testable, and fallback-protected.

Six patterns are covered: Router - Schema Enforcer - Output Validator - Loop Controller - Consequence Modeler - Context Pruner. Skip ahead if you're here for a specific one.

Setup and Shared Infrastructure

Before patterns, the scaffolding. All examples use LangGraph - a framework for building stateful, multi-actor agentic workflows as directed graphs. Each node in the graph is a Python function that reads from and writes to shared state. Edges define the execution flow. Conditional edges let you branch based on state values - this is how the SLM's routing decision becomes a graph transition. All examples share the same base setup.

code

from langchain_openai import ChatOpenAIfrom langchain_community.llms import Ollamafrom langgraph.graph import StateGraph, ENDfrom langgraph.checkpoint.memory import MemorySaverfrom typing import TypedDict, Literal, Annotatedimport operatorimport json# Frontier LLM - used only for reasoning stepsllm = ChatOpenAI(model="gpt-4o", temperature=0)# SLM - used for routing, validation, classification# Run locally via Ollama, or swap for any small model APIslm = Ollama(model="phi3:mini", temperature=0)# Shared state schema - every graph in this article uses this baseclass AgentState(TypedDict):    messages: Annotated[list, operator.add]    route: str    confidence: float    intent: dict    loop_count: int    last_action_result: str    escalate: bool

Two things worth calling out. First, temperature=0 on both models. Routing and validation decisions must be deterministic. Any non-zero temperature introduces variance into your control layer, which means the same input might route differently on different calls. That's a silent failure mode that's extremely hard to debug.

Second, loop_count in state. Every agentic system needs a circuit breaker. You will add this if you don't build it in upfront - the only question is whether you add it before or after your first runaway loop incident.

Pattern 1 - The SLM Router

The router is the entry point of every SLM-first system. It classifies incoming requests and assigns execution paths. The LLM never sees requests that the router can handle or redirect to simpler paths.

The Wrong Way

code

# Don't do thisdef route_with_llm(state: AgentState) -> AgentState:    response = llm.invoke(        f"Classify this request: {state['messages'][-1].content}\n"        "Return one of: simple_lookup, tool_call, complex_reasoning"    )    state["route"] = response.content.strip()    return state

This sends every request to a frontier model to decide which requests need a frontier model. The cost is 100x higher than necessary. The latency adds 500ms-1s before any real work starts. And because frontier models are verbose, you'll spend tokens on the classification explanation, not just the classification.

The Right Way

code

ROUTER_PROMPT = """You are a request router. Classify the request into exactly one category.Categories:- simple_lookup: factual questions, status checks, definitions- tool_call: actions requiring an API or database call- complex_reasoning: multi-step analysis, synthesis, judgment calls- clarification_needed: ambiguous or incomplete requestsRequest: {request}Respond with JSON only:{{"route": "<category>", "confidence": <0.0-1.0>, "reason": "<one sentence>"}}"""def slm_router(state: AgentState) -> AgentState:    request = state["messages"][-1].content    prompt = ROUTER_PROMPT.format(request=request)    raw = slm.invoke(prompt)    try:        parsed = json.loads(raw.strip())        state["route"] = parsed["route"]        state["confidence"] = parsed["confidence"]    except (json.JSONDecodeError, KeyError):        # Parsing failure -> safe default is escalation, not silent wrong routing        state["route"] = "complex_reasoning"        state["confidence"] = 0.0    return statedef route_decision(state: AgentState) -> str:    # Low confidence escalates to LLM regardless of route label    if state["confidence"] < 0.7:        return "llm_handler"    route_map = {        "simple_lookup": "direct_response",        "tool_call": "tool_executor",        "complex_reasoning": "llm_handler",        "clarification_needed": "clarification_handler",    }    return route_map.get(state["route"], "llm_handler")

Three things in this pattern that matter:

Structured output. The SLM returns JSON, not a sentence. Natural language classification responses are brittle - "I think this is a tool call" versus "tool_call" require different parsing logic and the former will break. Force JSON. Validate it. If parsing fails, don't silently default to any route - escalate.

Confidence threshold. The SLM labels every decision with a confidence score. Anything below 0.7 escalates to the LLM regardless of the route label. This single mechanism catches most distribution shift failures automatically - the SLM doesn't know what it doesn't know, but it does know when it's uncertain.

Graceful degradation. The except block defaults to complex_reasoning + confidence 0.0, which routes to the LLM handler. This means a router failure produces a slower, more expensive response - not a crash, not a silent wrong routing.

Wiring into LangGraph

code

def build_routing_graph():    graph = StateGraph(AgentState)    graph.add_node("router", slm_router)    graph.add_node("direct_response", direct_response_node)    graph.add_node("tool_executor", tool_executor_node)    graph.add_node("llm_handler", llm_handler_node)    graph.add_node("clarification_handler", clarification_handler_node)    graph.set_entry_point("router")    graph.add_conditional_edges(        "router",        route_decision,        {            "direct_response": "direct_response",            "tool_executor": "tool_executor",            "llm_handler": "llm_handler",            "clarification_handler": "clarification_handler",        }    )    graph.add_edge("direct_response", END)    graph.add_edge("tool_executor", END)    graph.add_edge("llm_handler", END)    graph.add_edge("clarification_handler", END)    return graph.compile(checkpointer=MemorySaver())

Every node that produces a terminal response routes to END. The checkpointer=MemorySaver() persists state across turns - without this, multi-turn conversations lose routing history, which means you can't detect when a user's request type changes across turns.

Pattern 2 - The Schema Enforcer

The schema enforcer transforms natural language into typed, structured payloads. The pattern is: SLM extracts intent and parameters, validates the output, and hands structured data to your downstream service. The downstream service never touches natural language.

code

SCHEMA_ENFORCER_PROMPT = """Extract structured data from the user request.Target schema:{schema}User request: {request}Rules:- Return valid JSON matching the schema exactly- Use null for missing optional fields- If a required field cannot be determined, set "extraction_failed": true- Do not invent values - only extract what is explicitly statedRespond with JSON only."""def make_schema_enforcer(schema: dict, schema_name: str):    """Factory that returns a schema enforcer node for a specific schema."""    def schema_enforcer_node(state: AgentState) -> AgentState:        request = state["messages"][-1].content        prompt = SCHEMA_ENFORCER_PROMPT.format(            schema=json.dumps(schema, indent=2),            request=request        )        raw = slm.invoke(prompt)        try:            extracted = json.loads(raw.strip())        except json.JSONDecodeError:            state["intent"] = {"extraction_failed": True, "raw": raw}            state["escalate"] = True            return state        if extracted.get("extraction_failed"):            state["intent"] = extracted            state["escalate"] = True        else:            state["intent"] = extracted            state["escalate"] = False        return state    schema_enforcer_node.__name__ = f"schema_enforcer_{schema_name}"    return schema_enforcer_node# Example: calendar event schemaCALENDAR_SCHEMA = {    "intent": "string (create_event | update_event | delete_event)",    "title": "string (required)",    "start_time": "ISO 8601 datetime (required)",    "end_time": "ISO 8601 datetime (optional)",    "attendees": "array of email strings (optional)",    "location": "string (optional)"}calendar_enforcer = make_schema_enforcer(CALENDAR_SCHEMA, "calendar")

The factory pattern matters here. You'll have multiple schemas in a real system - calendar, tickets, orders, search queries. Rather than writing a separate node for each, the factory generates typed enforcer nodes on demand. Each enforcer carries its schema definition in its closure, so the node is self-documenting.

The extraction_failed flag is the pattern's safety valve. If the SLM cannot confidently extract a required field - the user said "book a meeting with Arjun" without specifying a time - the flag triggers escalation. The escalation path can ask for clarification, hand off to the LLM for inference, or return an error to the user. What it should never do is silently pass incomplete data to the downstream service.

Validation Layer

The schema enforcer extracts data. A separate validator verifies it. These are different jobs.

code

def validate_calendar_intent(state: AgentState) -> AgentState:    intent = state.get("intent", {})    errors = []    # Required field check    if not intent.get("title"):        errors.append("title is required")    if not intent.get("start_time"):        errors.append("start_time is required")    # Type/format validation    if intent.get("start_time"):        try:            from datetime import datetime            datetime.fromisoformat(intent["start_time"])        except ValueError:            errors.append(f"start_time is not valid ISO 8601: {intent['start_time']}")    # Business rule validation    if intent.get("start_time") and intent.get("end_time"):        from datetime import datetime        start = datetime.fromisoformat(intent["start_time"])        end = datetime.fromisoformat(intent["end_time"])        if end <= start:            errors.append("end_time must be after start_time")    if errors:        state["last_action_result"] = f"Validation failed: {'; '.join(errors)}"        state["escalate"] = True    else:        state["escalate"] = False    return state

Validation catches two different failure categories. SLM extraction errors - wrong format, hallucinated values, incomplete fields. And business rule violations - end time before start time, past dates, invalid combinations. Both need to fail before the downstream API call, not after.

Pattern 3 - The Output Validator

The output validator runs after the LLM generates a response, before it reaches the user. It's the last line of defense against PII leakage, schema violations, safety issues, and length anomalies.

code

OUTPUT_VALIDATOR_PROMPT = """You are an output safety validator. Review the AI response below.Check for:1. PII exposure (names + contact info combined, SSNs, credit card numbers, passwords)2. Schema compliance: does the response match the expected format? Expected: {expected_format}3. Hallucination signals: confident specific claims without basis (exact numbers, dates, names not in context)4. Toxicity or policy violationsResponse to validate:{response}Return JSON:{{"pass": true/false, "issues": ["issue1", "issue2"], "severity": "low|medium|high|critical"}}If pass is false and severity is critical, the response must be blocked entirely."""def output_validator(state: AgentState, expected_format: str = "natural language") -> AgentState:    last_message = state["messages"][-1].content    prompt = OUTPUT_VALIDATOR_PROMPT.format(        response=last_message,        expected_format=expected_format    )    raw = slm.invoke(prompt)    try:        result = json.loads(raw.strip())    except json.JSONDecodeError:        # Validator itself failed - safe default is to flag for review        state["last_action_result"] = "validator_error"        state["escalate"] = True        return state    if not result.get("pass", True):        severity = result.get("severity", "medium")        if severity == "critical":            # Block the response entirely, return safe error message            state["messages"].append(AIMessage(                content="I encountered an issue generating a safe response. Please try rephrasing your request."            ))            state["escalate"] = True        elif severity in ("high", "medium"):            # Flag for human review, still return response            state["last_action_result"] = f"flagged:{severity}:{','.join(result.get('issues', []))}"    return state

The severity tiering is important. Not all validation failures warrant blocking the response. A slightly long response is low. A response that mentions someone's name alongside their phone number is critical. Blocking everything that fails any check creates a system that's too conservative to be useful. Tiering lets you tune the response: critical issues get blocked, medium issues get flagged for review, low issues get logged.

Pattern 4 - The Loop Controller

The loop controller is the most consequential SLM pattern in agentic systems. It makes the continue/terminate/escalate decision at every iteration of the agentic loop. Getting this wrong means either premature termination (tasks not completed) or runaway loops (tasks running forever, burning money, potentially executing destructive actions repeatedly).

code

LOOP_CONTROLLER_PROMPT = """You are a loop controller for an agentic system. Evaluate the current state.Task: {task}Loop iteration: {loop_count} of {max_loops}Last action result: {last_result}Conversation so far: {history}Decide the next action:- CONTINUE: task not complete, last action succeeded, safe to proceed- COMPLETE: task is finished, return result to user- REPLAN: task not complete but current approach is failing, need new strategy- ESCALATE: situation requires human judgment or LLM reasoning- ABORT: task cannot be completed, too many failures, or approaching loop limitRules:- If loop_count >= {max_loops} * 0.8, prefer COMPLETE or ABORT over CONTINUE- If last action failed twice in a row, prefer REPLAN or ESCALATE- Never CONTINUE if last_result contains an unhandled errorReturn JSON:{{"decision": "<CONTINUE|COMPLETE|REPLAN|ESCALATE|ABORT>",  "reason": "<one sentence>",  "confidence": <0.0-1.0>}}"""MAX_LOOPS = 10ESCALATE_THRESHOLD = 0.65def loop_controller(state: AgentState) -> AgentState:    loop_count = state.get("loop_count", 0) + 1    state["loop_count"] = loop_count    # Hard circuit breaker - never trust the SLM to self-limit    if loop_count >= MAX_LOOPS:        state["route"] = "ABORT"        return state    task = state["messages"][0].content    last_result = state.get("last_action_result", "no action yet")    history = "\n".join([        f"{m.type}: {m.content[:200]}"        for m in state["messages"][-6:]  # Last 3 turns    ])    prompt = LOOP_CONTROLLER_PROMPT.format(        task=task,        loop_count=loop_count,        max_loops=MAX_LOOPS,        last_result=last_result,        history=history    )    raw = slm.invoke(prompt)    try:        result = json.loads(raw.strip())        decision = result["decision"]        confidence = result.get("confidence", 0.5)    except (json.JSONDecodeError, KeyError):        # Controller parse failure - safe default is ESCALATE        state["route"] = "ESCALATE"        return state    # Low confidence decisions escalate regardless of label    if confidence < ESCALATE_THRESHOLD and decision not in ("COMPLETE", "ABORT"):        decision = "ESCALATE"    state["route"] = decision    return statedef loop_decision(state: AgentState) -> str:    decision_map = {        "CONTINUE": "action_executor",        "COMPLETE": "response_formatter",        "REPLAN": "llm_planner",        "ESCALATE": "llm_handler",        "ABORT": "abort_handler",    }    return decision_map.get(state["route"], "llm_handler")

The hard circuit breaker at line 4 of loop_controller is non-negotiable. The SLM's prompt tells it to prefer termination near the loop limit, but that instruction competes with its training to complete tasks. A task-completion bias in the SLM can override soft guidance. The hard check doesn't ask the SLM - it just stops the loop. Put this check first, before the SLM call.

The 80% threshold in the prompt passes MAX_LOOPS as a format variable, so the SLM receives a concrete number ("8 of 10") rather than an expression. This produces more graceful terminations - the SLM starts preferring COMPLETE or ABORT at iteration 8, rather than hitting the hard stop at 10 and returning whatever partial state exists.

Full Agentic Loop Graph

code

def build_agentic_loop_graph():    graph = StateGraph(AgentState)    # Nodes    graph.add_node("loop_controller", loop_controller)    graph.add_node("action_executor", action_executor_node)    graph.add_node("response_formatter", response_formatter_node)    graph.add_node("llm_planner", llm_planner_node)    graph.add_node("llm_handler", llm_handler_node)    graph.add_node("abort_handler", abort_handler_node)    # Entry    graph.set_entry_point("loop_controller")    # Loop controller routes    graph.add_conditional_edges(        "loop_controller",        loop_decision,        {            "action_executor": "action_executor",            "response_formatter": "response_formatter",            "llm_planner": "llm_planner",            "llm_handler": "llm_handler",            "abort_handler": "abort_handler",        }    )    # Action executor feeds back into the loop controller    graph.add_edge("action_executor", "loop_controller")    # LLM planner feeds back into the loop controller with updated plan    graph.add_edge("llm_planner", "loop_controller")    # Terminal nodes    graph.add_edge("response_formatter", END)    graph.add_edge("llm_handler", END)    graph.add_edge("abort_handler", END)    return graph.compile(checkpointer=MemorySaver())

The critical edges are action_executor -> loop_controller and llm_planner -> loop_controller. These are the feedback loops. Every completed action returns to the controller for evaluation before the next action executes. This is what makes the loop controlled rather than runaway - every iteration requires the controller to explicitly authorize continuation.

action_executor_node is deliberately domain-specific - it's where your actual tool dispatch lives. Here's the stub:

code

def action_executor_node(state: AgentState) -> AgentState:    """    Executes the next action in the agentic loop.    Replace this stub with your tool dispatch logic.    Contract:    - Read the current plan from state["messages"]    - Execute one action (one tool call, one API call, one write operation)    - Write the result to state["last_action_result"]    - Append result as AIMessage to state["messages"]    - Never execute more than one action per call - the loop controller decides whether to continue    """    # Example: extract action from last LLM plan message    plan = state["messages"][-1].content    # Replace with your actual tool dispatch    # e.g. result = your_tool_registry.execute(plan)    result = f"Executed action from plan: {plan[:100]}"    state["last_action_result"] = result    state["messages"] = [AIMessage(content=result)]    return state

The docstring is the contract. One action per call is the invariant. The loop controller decides continuation - the executor should never decide to keep going on its own.

Human-in-the-Loop with `interrupt_before`

When the consequence modeler flags an action as requiring human approval, you need the graph to pause and wait - not abort, not escalate to the LLM, but actually suspend and hand control back to a human. LangGraph's interrupt_before mechanism handles this:

code

from langgraph.checkpoint.memory import MemorySaverfrom langgraph.types import interruptdef human_approval_node(state: AgentState) -> AgentState:    """    Pauses the graph and surfaces the pending action for human review.    The graph resumes when the caller invokes graph.invoke() again    with the same thread_id and an updated state.    """    pending_action = state.get("intent", {})    blocked_reason = state.get("last_action_result", "requires approval")    # interrupt() pauses execution here and returns control to the caller    # The caller sees the graph in a "interrupted" state and can:    # 1. Approve: resume with state["escalate"] = False    # 2. Reject: resume with state["escalate"] = True and a rejection message    human_decision = interrupt({        "pending_action": pending_action,        "reason": blocked_reason,        "message": "This action requires your approval before execution."    })    # human_decision is whatever the caller passes when resuming    if human_decision.get("approved"):        state["escalate"] = False        state["last_action_result"] = "approved_by_human"    else:        state["escalate"] = True        state["last_action_result"] = f"rejected_by_human:{human_decision.get('reason', '')}"    return statedef build_agentic_loop_with_approval():    graph = StateGraph(AgentState)    graph.add_node("loop_controller", loop_controller)    graph.add_node("action_executor", action_executor_node)    graph.add_node("consequence_check", consequence_check_node)    graph.add_node("human_approval", human_approval_node)    graph.add_node("response_formatter", response_formatter_node)    graph.add_node("abort_handler", abort_handler_node)    graph.add_node("llm_handler", llm_handler_node)    graph.set_entry_point("loop_controller")    graph.add_conditional_edges(        "loop_controller",        loop_decision,        {            "action_executor": "consequence_check",  # Check before every execution            "response_formatter": "response_formatter",            "llm_planner": "llm_handler",            "llm_handler": "llm_handler",            "abort_handler": "abort_handler",        }    )    # Consequence check gates every action    graph.add_conditional_edges(        "consequence_check",        lambda s: "human_approval" if s.get("escalate") else "action_executor",        {            "human_approval": "human_approval",            "action_executor": "action_executor",        }    )    # After human decision, either execute or return to loop    graph.add_conditional_edges(        "human_approval",        lambda s: "action_executor" if not s.get("escalate") else "loop_controller",        {            "action_executor": "action_executor",            "loop_controller": "loop_controller",        }    )    graph.add_edge("action_executor", "loop_controller")    graph.add_edge("response_formatter", END)    graph.add_edge("llm_handler", END)    graph.add_edge("abort_handler", END)    # interrupt_before tells LangGraph to pause BEFORE entering human_approval    # This surfaces the pending decision to the caller without executing the node    return graph.compile(        checkpointer=MemorySaver(),        interrupt_before=["human_approval"]    )

The caller workflow:

code

config = {"configurable": {"thread_id": "task-001"}}graph = build_agentic_loop_with_approval()# Start the task - runs until it hits human_approvalresult = graph.invoke(initial_state, config=config)# Graph is now paused. Inspect the pending action:state = graph.get_state(config)pending = state.values.get("intent")print(f"Pending action requiring approval: {pending}")# Human approves - resume with decisiongraph.invoke(    {"human_decision": {"approved": True}},    config=config)# Human rejects - resume with rejection reasongraph.invoke(    {"human_decision": {"approved": False, "reason": "Budget not authorized"}},    config=config)

interrupt_before is LangGraph's native mechanism for this. The graph state is fully persisted at the pause point - you can close the process, restart it, and resume from exactly where it stopped, as long as you use the same thread_id and a persistent checkpointer (swap MemorySaver for SqliteSaver or PostgresSaver in production).

Pattern 5 - The Consequence Modeler

The consequence modeler runs before irreversible or write actions. It classifies whether an action is safe to execute, safe to retry, requires human approval, or should be blocked. If you have agentic loops executing write operations, you need this pattern.

code

CONSEQUENCE_MODELER_PROMPT = """You are a consequence modeler for an agentic system.Classify the proposed action before it executes.Proposed action: {action}Action parameters: {parameters}System context: {context}Classification dimensions:1. Reversibility: can this action be undone? (reversible | partially_reversible | irreversible)2. Idempotency: is this action safe to retry? (idempotent | non_idempotent)3. Blast radius: what breaks if this fails or produces wrong output? (isolated | service_level | system_wide)4. Data sensitivity: does this action touch sensitive data? (none | low | high | critical)Return JSON:{{  "reversibility": "<reversible|partially_reversible|irreversible>",  "idempotency": "<idempotent|non_idempotent>",  "blast_radius": "<isolated|service_level|system_wide>",  "data_sensitivity": "<none|low|high|critical>",  "requires_approval": <true|false>,  "safe_to_execute": <true|false>,  "reason": "<one sentence>"}}"""def consequence_modeler(state: AgentState, action: str, parameters: dict) -> dict:    context = {        "loop_count": state.get("loop_count", 0),        "previous_actions": state.get("messages", [])[-4:],    }    prompt = CONSEQUENCE_MODELER_PROMPT.format(        action=action,        parameters=json.dumps(parameters),        context=json.dumps(context, default=str)    )    raw = slm.invoke(prompt)    try:        result = json.loads(raw.strip())    except json.JSONDecodeError:        # Consequence modeler failure - block the action, don't guess        return {            "safe_to_execute": False,            "requires_approval": True,            "reason": "Consequence modeler failed to parse - blocking for safety"        }    # Override: irreversible + non_idempotent always requires approval    if (result.get("reversibility") == "irreversible" and            result.get("idempotency") == "non_idempotent"):        result["requires_approval"] = True        result["safe_to_execute"] = False    # Override: system_wide blast radius always requires approval    if result.get("blast_radius") == "system_wide":        result["requires_approval"] = True    return result

The two override blocks at the bottom are deliberate. They enforce invariants that the SLM's probabilistic reasoning should never override: an irreversible, non-idempotent action always requires human approval, and a system-wide blast radius always requires human approval. These are not recommendations - they're hard rules. The SLM classifies; the overrides enforce the policy.

Idempotency Key Integration

The consequence modeler's idempotency classification feeds directly into how actions execute:

code

import hashlibimport timedef execute_with_idempotency(    action_fn,    parameters: dict,    consequence: dict,    state: AgentState) -> dict:    """Execute an action with idempotency key if classified as non-idempotent."""    if consequence.get("idempotency") == "non_idempotent":        # Generate deterministic idempotency key from action + params + turn        key_source = json.dumps({            "params": parameters,            "loop_count": state.get("loop_count"),            "message_id": len(state.get("messages", []))        }, sort_keys=True)        idempotency_key = hashlib.sha256(key_source.encode()).hexdigest()[:32]        return action_fn(**parameters, idempotency_key=idempotency_key)    else:        return action_fn(**parameters)

The idempotency key is deterministic - it's derived from the action parameters and position in the conversation. This means if the loop controller retries the same action at the same position (same loop count, same message count), it generates the same key. The downstream service sees the retry as a duplicate and returns the original result without re-executing. The double-charge problem is solved at the execution layer, not by hoping the loop controller doesn't retry.

Pattern 6 - The Context Pruner

The context pruner sits between your retrieval layer and your LLM. It reduces retrieved chunks to the highest-signal subset. The most common implementation uses a cross-encoder reranker - a small model purpose-built to score (query, passage) relevance pairs.

Note on terminology: In RAG pipelines you'll often see this called a "reranker" rather than a "context pruner." They're the same role. A cross-encoder like cross-encoder/ms-marco-MiniLM-L-6-v2 scores and ranks chunks. A general SLM pruner can also compress, deduplicate, and restructure. The cross-encoder is the most common and most efficient implementation for pure reranking tasks.

code

from sentence_transformers import CrossEncoder# Dedicated cross-encoder reranker - faster and more accurate than# a general SLM for pure relevance scoringreranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")def context_pruner_node(state: AgentState, retrieved_chunks: list[str]) -> AgentState:    query = state["messages"][-1].content    if len(retrieved_chunks) <= 3:        # Small enough to pass through directly        state["intent"]["context"] = retrieved_chunks        return state    # Score all (query, chunk) pairs    pairs = [(query, chunk) for chunk in retrieved_chunks]    scores = reranker.predict(pairs)    # Rank and keep top-k    ranked = sorted(        zip(scores, retrieved_chunks),        key=lambda x: x[0],        reverse=True    )    TOP_K = 4    MIN_RELEVANCE_SCORE = 0.1  # Chunks below this are noise regardless of rank    filtered = [        chunk for score, chunk in ranked[:TOP_K]        if score > MIN_RELEVANCE_SCORE    ]    if not filtered:        # All chunks below threshold - better to have the LLM say        # "I don't have enough context" than to hallucinate from noise        state["intent"]["context"] = []        state["intent"]["context_insufficient"] = True    else:        state["intent"]["context"] = filtered        state["intent"]["context_insufficient"] = False    return state

The MIN_RELEVANCE_SCORE threshold is the pattern's most important parameter. Without it, the pruner always returns TOP_K chunks even when all of them are irrelevant to the query. A cross-encoder score below 0.1 means the chunk has essentially no relevance - passing it to the LLM adds noise, not signal, and increases hallucination risk. An empty context that triggers a "I don't have enough information" response is more honest and more useful than four irrelevant chunks that push the LLM toward a confident wrong answer.

When to Use a General SLM Instead

Use a cross-encoder for pure reranking. Use a general SLM pruner when you need to do more than rank:

code

SLM_PRUNER_PROMPT = """Given the query and retrieved chunks, produce a pruned context.Query: {query}Retrieved chunks:{chunks}Tasks:1. Remove chunks with no relevance to the query2. Remove duplicate information across chunks3. If two chunks say the same thing, keep the more specific one4. Preserve all unique facts relevant to the queryReturn a JSON array of the chunks to keep, in order of relevance.Do not modify chunk text. Return chunk indices only: [0, 3, 5]"""def slm_context_pruner_node(state: AgentState, retrieved_chunks: list[str]) -> AgentState:    query = state["messages"][-1].content    chunks_formatted = "\n\n".join([        f"[{i}] {chunk}" for i, chunk in enumerate(retrieved_chunks)    ])    prompt = SLM_PRUNER_PROMPT.format(        query=query,        chunks=chunks_formatted    )    raw = slm.invoke(prompt)    try:        indices = json.loads(raw.strip())        pruned = [retrieved_chunks[i] for i in indices if i < len(retrieved_chunks)]    except (json.JSONDecodeError, IndexError):        # Pruner failure - fall back to first 4 chunks rather than all chunks        pruned = retrieved_chunks[:4]    state["intent"]["context"] = pruned    return state

Composing Patterns - A Complete Pipeline

Individual patterns are useful. The real value comes from composition. Here's a complete customer support pipeline that chains router, schema enforcer, consequence modeler, and loop controller.

Before the graph, the node stubs that complete the pipeline. These are the terminal and utility nodes referenced throughout:

code

from langchain_core.messages import AIMessage, HumanMessage, SystemMessagedef direct_lookup_node(state: AgentState) -> AgentState:    """Handles simple factual lookups without LLM - stub for your lookup logic."""    query = state["messages"][-1].content    # Replace with your actual lookup logic (DB query, cache hit, etc.)    result = f"Direct lookup result for: {query}"    state["messages"] = [AIMessage(content=result)]    state["last_action_result"] = "success"    return statedef tool_executor_node(state: AgentState) -> AgentState:    """Executes a validated, consequence-checked tool call."""    intent = state.get("intent", {})    # Replace with your actual tool dispatch logic    state["last_action_result"] = f"Tool executed: {intent.get('intent', 'unknown')}"    state["messages"] = [AIMessage(content=f"Action completed: {state['last_action_result']}")]    return statedef llm_handler_node(state: AgentState) -> AgentState:    """Sends request to the frontier LLM for reasoning tasks."""    context = state.get("intent", {}).get("context", [])    context_str = "\n\n".join(context) if context else ""    messages = state["messages"]    if context_str:        messages = [SystemMessage(content=f"Context:\n{context_str}")] + messages    response = llm.invoke(messages)    state["messages"] = [response]    state["last_action_result"] = "llm_response"    return statedef llm_planner_node(state: AgentState) -> AgentState:    """Re-plans when the loop controller signals REPLAN."""    task = state["messages"][0].content    last_result = state.get("last_action_result", "")    plan_prompt = (        f"Task: {task}\n"        f"Previous approach failed: {last_result}\n"        "Devise a different approach. Be specific about the next action to take."    )    response = llm.invoke(plan_prompt)    state["messages"] = [response]    state["last_action_result"] = "replanned"    return statedef response_formatter_node(state: AgentState) -> AgentState:    """Formats the final response when the loop controller signals COMPLETE."""    # Last message already contains the result - clean it up if needed    state["last_action_result"] = "complete"    return statedef abort_handler_node(state: AgentState) -> AgentState:    """Returns a safe error when the loop hits the hard circuit breaker."""    loop_count = state.get("loop_count", 0)    state["messages"] = [AIMessage(        content=(            f"The task could not be completed after {loop_count} attempts. "            "Please try rephrasing your request or contact support."        )    )]    state["last_action_result"] = "aborted"    return statedef clarification_handler_node(state: AgentState) -> AgentState:    """Asks the user for clarification when intent is ambiguous."""    state["messages"] = [AIMessage(        content="I need a bit more information to help you. Could you clarify what you're trying to accomplish?"    )]    return statedef consequence_check_node(state: AgentState) -> AgentState:    """Wrapper that runs consequence_modeler against the current intent."""    intent = state.get("intent", {})    action = intent.get("intent", "unknown_action")    parameters = {k: v for k, v in intent.items() if k != "intent"}    consequence = consequence_modeler(state, action, parameters)    if not consequence.get("safe_to_execute", False) or consequence.get("requires_approval", False):        state["escalate"] = True        state["last_action_result"] = f"blocked:{consequence.get('reason', 'requires approval')}"    else:        state["escalate"] = False    return statedef context_pruner_wrapper(state: AgentState) -> AgentState:    """Retrieves and prunes context before passing to LLM handler."""    from langchain_chroma import Chroma    from langchain_openai import OpenAIEmbeddings    query = state["messages"][-1].content    # Concrete retrieval example using Chroma - swap for your vector DB    # (FAISS: use FAISS.load_local(...).similarity_search(query, k=10))    # (Pinecone: use index.query(vector=embed(query), top_k=10))    vectorstore = Chroma(        collection_name="knowledge_base",        embedding_function=OpenAIEmbeddings(),        persist_directory="./chroma_db"    )    docs = vectorstore.similarity_search(query, k=10)    retrieved_chunks = [doc.page_content for doc in docs]    return context_pruner_node(state, retrieved_chunks)

Here's the full pipeline architecture before the code:

mermaid

graph TD
    A[User Input] --> B{SLM Router}
    B -->|simple_lookup| C[Direct Lookup]
    B -->|tool_call| D[Schema Enforcer SLM]
    B -->|complex_reasoning| E[Context Pruner]
    B -->|clarification_needed| K[Output Validator]

    D --> F[Intent Validator]
    F -->|valid| G[Consequence Check SLM]
    F -->|invalid| J[LLM Handler]
    G -->|safe| H[Tool Executor]
    G -->|blocked| J[LLM Handler]

    E --> J
    H --> K[Output Validator]
    J --> K
    C --> K
    K --> L[Response to User]

code

def build_support_pipeline():    graph = StateGraph(AgentState)    # Stage 1: Route the request    graph.add_node("router", slm_router)    # Stage 2a: Simple lookup path (no LLM)    graph.add_node("direct_lookup", direct_lookup_node)    # Stage 2b: Tool execution path    graph.add_node("schema_enforcer", calendar_enforcer)    graph.add_node("intent_validator", validate_calendar_intent)    graph.add_node("consequence_check", consequence_check_node)    graph.add_node("tool_executor", tool_executor_node)    # Stage 2c: Complex reasoning path    graph.add_node("context_pruner", context_pruner_wrapper)    graph.add_node("llm_handler", llm_handler_node)    # Stage 3: Output validation (all paths)    graph.add_node("output_validator", output_validator)    # Entry    graph.set_entry_point("router")    # Router branches    graph.add_conditional_edges(        "router",        route_decision,        {            "direct_response": "direct_lookup",            "tool_executor": "schema_enforcer",            "llm_handler": "context_pruner",            "clarification_handler": "output_validator",  # Return clarification request        }    )    # Tool path: enforce -> validate -> consequence check -> execute    graph.add_edge("schema_enforcer", "intent_validator")    graph.add_conditional_edges(        "intent_validator",        lambda s: "consequence_check" if not s.get("escalate") else "llm_handler",        {            "consequence_check": "consequence_check",            "llm_handler": "llm_handler",        }    )    graph.add_conditional_edges(        "consequence_check",        lambda s: "tool_executor" if not s.get("escalate") else "llm_handler",        {            "tool_executor": "tool_executor",            "llm_handler": "llm_handler",        }    )    # All paths converge at output validator    graph.add_edge("direct_lookup", "output_validator")    graph.add_edge("tool_executor", "output_validator")    graph.add_edge("llm_handler", "output_validator")    graph.add_edge("context_pruner", "llm_handler")    graph.add_edge("output_validator", END)    return graph.compile(checkpointer=MemorySaver())

The convergence at output_validator is intentional. Every path - simple lookup, tool execution, LLM reasoning - runs through the same output validation before returning to the user. This is the single point where PII checks, length validation, and safety filtering happen. One node, universal coverage.

Observability - Logging Every Decision

Every SLM decision in every pattern above must be logged. Not for debugging convenience - for operational necessity. You cannot improve a router you cannot observe. You cannot detect distribution shift in a validator you cannot measure. You cannot tune a loop controller you cannot trace.

code

import structlogfrom opentelemetry import tracelogger = structlog.get_logger()tracer = trace.get_tracer("slm-pipeline")def instrument_slm_node(node_fn, node_name: str):    """Decorator that adds structured logging and OpenTelemetry spans to any SLM node."""    def instrumented(state: AgentState) -> AgentState:        with tracer.start_as_current_span(f"slm.{node_name}") as span:            input_summary = state["messages"][-1].content[:100] if state.get("messages") else ""            span.set_attribute("slm.node", node_name)            span.set_attribute("slm.input_preview", input_summary)            span.set_attribute("slm.loop_count", state.get("loop_count", 0))            result_state = node_fn(state)            span.set_attribute("slm.route", result_state.get("route", ""))            span.set_attribute("slm.confidence", result_state.get("confidence", 0.0))            span.set_attribute("slm.escalate", result_state.get("escalate", False))            logger.info(                "slm_decision",                node=node_name,                route=result_state.get("route"),                confidence=result_state.get("confidence"),                escalate=result_state.get("escalate"),                loop_count=result_state.get("loop_count"),                input_preview=input_summary,            )            return result_state    return instrumented# Apply to routerinstrumented_router = instrument_slm_node(slm_router, "router")

The instrument_slm_node decorator wraps any SLM node with an OpenTelemetry span and a structured log entry. Your existing SRE tooling - Jaeger, Grafana, Datadog - picks up the spans automatically. The log entries feed into whatever aggregation pipeline you already run. No custom dashboards required.

The attributes captured per decision - node name, route label, confidence score, escalation flag, loop count - are the minimum viable observability set. Confidence score drift over time tells you when your input distribution has shifted. Escalation rate per node tells you which SLMs are failing most often. Loop count distribution tells you whether your loop controller is terminating too early or too late.

For LLM-specific observability - prompt/response pairs, token counts, and latency per node - Langfuse and LangSmith both integrate directly with LangGraph via callback handlers. The instrument_slm_node decorator above handles the SLM control layer; Langfuse handles the LLM reasoning layer. You want both - they cover different parts of the pipeline and have different failure signatures.

Testing SLM Patterns

You cannot ship a router to production and trust it based on manual testing. Each pattern needs a test suite that runs before deployment and after every fine-tuning cycle.

Think of this as decision contract testing - the same principle as contract testing in microservices, applied to SLM decision nodes. Each SLM node has an implicit contract: given this input, produce this decision with at least this confidence. The test suite makes that contract explicit and verifiable. When the contract breaks - a new model version, a distribution shift, a prompt change - the tests catch it before production does.

code

import pytestfrom langchain_core.messages import HumanMessage# Router test suiteROUTER_TEST_CASES = [    # (input, expected_route, min_confidence)    ("What is my order status for #12345?",       "tool_call",          0.8),    ("Explain the difference between TCP and UDP", "complex_reasoning",  0.75),    ("What does refund mean?",                     "simple_lookup",      0.8),    ("do the thing",                               "clarification_needed", 0.6),    ("Book a meeting with Priya tomorrow at 3pm",  "tool_call",          0.8),    ("Why is my payment failing?",                 "complex_reasoning",  0.7),]@pytest.mark.parametrize("request,expected_route,min_confidence", ROUTER_TEST_CASES)def test_router_classification(request, expected_route, min_confidence):    state = AgentState(        messages=[HumanMessage(content=request)],        route="", confidence=0.0, intent={},        loop_count=0, last_action_result="", escalate=False    )    result = slm_router(state)    assert result["route"] == expected_route, (        f"Expected {expected_route}, got {result['route']} "        f"(confidence: {result['confidence']:.2f}) for: {request}"    )    assert result["confidence"] >= min_confidence, (        f"Confidence {result['confidence']:.2f} below threshold {min_confidence} for: {request}"    )# Validator test suiteVALIDATOR_TEST_CASES = [    # (intent_dict, should_pass)    ({"intent": "create_event", "title": "Standup", "start_time": "2026-04-01T09:00:00"}, True),    ({"intent": "create_event", "title": "", "start_time": "2026-04-01T09:00:00"},          False),  # Empty title    ({"intent": "create_event", "title": "Standup", "start_time": "not-a-date"},            False),  # Bad date    ({"intent": "create_event", "title": "Standup",      "start_time": "2026-04-01T10:00:00", "end_time": "2026-04-01T09:00:00"},              False),  # End before start]@pytest.mark.parametrize("intent,should_pass", VALIDATOR_TEST_CASES)def test_intent_validator(intent, should_pass):    state = AgentState(        messages=[], route="", confidence=0.9,        intent=intent, loop_count=0, last_action_result="", escalate=False    )    result = validate_calendar_intent(state)    if should_pass:        assert not result["escalate"], f"Validation failed unexpectedly for: {intent}"    else:        assert result["escalate"], f"Validation passed unexpectedly for: {intent}"

Two test categories that matter most. First, route coverage - every intent category the router is expected to handle should have at least 5 test cases, including edge cases and adversarial inputs. A router with 100% accuracy on your test set but 85% in production means your test set doesn't reflect your traffic distribution.

Second, validator invariants - every business rule the validator enforces should have a failing test case. If you can't write a failing case for a validation rule, the rule isn't testable and shouldn't be in production.

Run these before every deployment. Run them nightly against production traffic samples. The moment pass rate drops below your threshold, you have a distribution shift event.

Common Failure Modes

These patterns fail in predictable ways. Knowing the failure modes is as important as knowing the patterns.

The silent wrong route. The router assigns a route with high confidence, but the assignment is wrong. This happens most often when the user's phrasing is unusual or domain-specific in ways the router hasn't seen. You discover it through user complaints, not monitoring. Fix: add a feedback mechanism that lets users flag incorrect handling, and feed flagged cases back into your router evaluation set.

The validation loop. The schema enforcer extracts incomplete data. The validator flags it. The escalation handler asks the LLM to fill in the gaps. The LLM's response goes back through the schema enforcer. The schema enforcer extracts slightly different incomplete data. Repeat. Fix: track the number of extraction attempts per request. After two failures, return a structured error to the user rather than looping.

The consequence modeler bypass. A developer adds a new tool action and forgets to register it with the consequence modeler. The modeler has no classification for the action and either silently passes it through or crashes. Fix: make consequence modeling mandatory at the framework level - every tool call must produce a consequence classification before execution. Unknown actions get blocked by default, not passed through.

The confidence floor. The SLM learns to output high confidence scores regardless of actual certainty, because high-confidence decisions that route correctly are reinforced during fine-tuning while the cost of low-confidence escalations is invisible during training. The confidence score becomes meaningless. Fix: evaluate confidence calibration as a separate metric. The fraction of high-confidence decisions that were correct should be consistently above your threshold. If it's not, retrain with explicit calibration objectives.

The loop controller persuasion attack. An adversarial input tells the loop controller it has already completed the task and should return a specific result. The controller, being a language model, may comply. Fix: never trust the loop controller's assessment of task completion without validating against the actual state. The task is complete when the output validator passes and the result answers the original query - not when the controller says it's complete.

What's Coming Next

This article covered the implementation patterns. The next article goes deeper into the fine-tuning layer:

Fine-Tuning SLMs for Production Roles - building training datasets for router and validator SLMs, running LoRA fine-tuning, and evaluating domain-specific SLMs against your actual workload distribution

The patterns above work with off-the-shelf SLMs. Fine-tuning gives you the next tier of reliability - a router that knows your specific intent taxonomy, a validator that understands your business rules natively, a loop controller trained on your domain's task completion patterns.

If you're building the orchestration layer that connects these patterns, the LangGraph series on this blog covers stateful multi-agent workflows, checkpoint persistence, and interrupt_before patterns for human-in-the-loop systems.

The patterns in this article are not AI patterns. They are software engineering patterns that happen to use AI at specific nodes. Build them that way.

A note on code: All examples use LangGraph 0.2.x and LangChain 0.3.x APIs. The Ollama integration requires Ollama running locally with the phi3:mini model pulled. For cloud-hosted SLMs, substitute ChatOpenAI with the appropriate model string. Code is illustrative - production systems require additional error handling, retry logic, and environment-specific configuration.

References

LangChain. LangGraph documentation. https://langchain-ai.github.io/langgraph/
LangChain. LangGraph StateGraph API reference. https://langchain-ai.github.io/langgraph/reference/graphs/
LangChain. LangGraph persistence and checkpointing. https://langchain-ai.github.io/langgraph/concepts/persistence/
LangChain. LangGraph human-in-the-loop with interrupt. https://langchain-ai.github.io/langgraph/concepts/human_in_the_loop/
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084
Hugging Face. cross-encoder/ms-marco-MiniLM-L-6-v2 model card. https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2
OpenTelemetry. Python SDK documentation. https://opentelemetry-python.readthedocs.io/
Microsoft Research. (2024). Phi-3 Technical Report. arXiv:2404.14219
Hu, E., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155

Architecture

Small Language Models Are Not Smaller GPTs - They're Infrastructure

More Articles

Fact-Checking in LLM Systems: From Hallucinations to Verifiable AI

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:

Setup and Shared Infrastructure

Pattern 1 - The SLM Router

The Wrong Way

The Right Way

Wiring into LangGraph

Pattern 2 - The Schema Enforcer

Validation Layer

Pattern 3 - The Output Validator

Pattern 4 - The Loop Controller

Full Agentic Loop Graph

Human-in-the-Loop with interrupt_before

Pattern 5 - The Consequence Modeler

Idempotency Key Integration

Pattern 6 - The Context Pruner

When to Use a General SLM Instead

Composing Patterns - A Complete Pipeline

Observability - Logging Every Decision

Testing SLM Patterns

Common Failure Modes

What's Coming Next

References

Related Articles

Comments

Human-in-the-Loop with `interrupt_before`