You Can't Debug What You Can't See: Observability for Claude Code Sessions

Your agent ran for 45 minutes, wrote 12 files, and shipped a PR. The tests pass. The code compiles. And something is subtly wrong - a security assumption that was violated three hours into the session, a context decision that poisoned the direction of the implementation, a tool call that silently returned an empty result and was treated as success.

You have the PR. You have the diff. You have no idea what the agent actually did to get there.

This is the default state of Claude Code without intentional observability: outcomes are visible, decisions are not. You can see what the agent produced. You cannot see why it made the choices it made, which context shaped those choices, which tool calls failed silently, or at what point the session diverged from what you expected.

For a coding agent running five-minute tasks, this is tolerable. For an agent running autonomous sessions for 45 minutes across worktrees, invoking subagents, firing hooks, and touching production-adjacent systems - it is a production liability. When something goes wrong and you cannot reconstruct what happened, you cannot fix the underlying cause. You can only hope it does not happen again.

This article builds the observability layer that makes Claude Code sessions reconstructable, debuggable, and improvable. Three layers: hooks-based audit logging (available today, zero infrastructure), OpenTelemetry metrics and traces (for teams and enterprise), and session-level debugging patterns that work regardless of which layer you have deployed.

Note: This article uses Claude Code as the reference implementation. The observability principles - structured event logging, per-step tracing, context-decision visibility, tool-call audit trails - apply to any agentic AI system. The specific mechanisms (hooks, OTel integration, .claude/audit.jsonl) are Claude Code-specific. The patterns transfer directly to LangSmith for LangGraph, the OpenAI Agent SDK's tracing API, or any framework that exposes execution hooks.

Why Agent Observability Is Different From Application Observability

In a conventional web application, observability is tractable. Requests have fixed entry points, known execution paths, and deterministic outputs for the same inputs. When something goes wrong, the stack trace tells you where.

Agents do not have fixed execution paths. The same prompt produces different tool call sequences on different runs depending on context state, model sampling, and discovered information. There is no stack trace for "the model decided to read the wrong file because the context contained a misleading summary from three steps ago." The failure is not an exception - it is a decision. And decisions are only visible if you recorded them.

Three failure modes are specific to coding agents and invisible without instrumentation:

Context-driven misalignment - The agent made a correct decision given what it saw in its context, but what it saw was wrong. A compressed summary from step 3 contained an inaccuracy. The agent built on it for 30 more steps. The final output is wrong for reasons that are invisible in the output itself. You can only see this failure if you recorded the context state at each step.

Silent tool failure - A subagent returned an empty result because a grep found no matches in a misspelled directory path. The orchestrator treated the empty result as "no findings" and proceeded. The bug that should have been found was not found. There is no error. There is no exception. There is only a missing result that looks like a clean result. You can only detect this if you recorded what the tool call returned.

Drift detection - The session started correctly and diverged at step 14. The context compaction at step 12 dropped a constraint. Every decision after step 12 was made without that constraint. The final output violates it. You can only find step 12 if you have a step-by-step record of what was in the context and what was decided at each point.

These three failure modes require different observability instruments. Context-driven misalignment requires context state snapshots. Silent tool failure requires tool-call result logging. Drift detection requires a timeline that shows what changed between steps.

Layer 1: Hooks-Based Audit Logging

The fastest path to session visibility is a set of hooks that write structured events to a JSONL file. No infrastructure. No external services. Readable immediately from the filesystem. This is the baseline every Claude Code project should have before anything else.

Four hooks cover the complete session lifecycle. You built some of these in the Hooks article. Here they are assembled into a complete audit system.

The Audit Schema

Every event shares the same envelope. Consistent schema means you can query any event type with the same tools:

code

{  "ts": "2026-04-30T09:14:22.341Z",  "session_id": "cc_abc123",  "event": "tool_call",  "hook": "PreToolUse",  "tool": "Bash",  "command": "grep -r 'JWT' src/auth/",  "blocked": false,  "reason": null,  "duration_ms": null,  "exit_code": null,  "output_preview": null}

The Four Audit Hooks

code

# .claude/hooks/audit.sh# Universal audit hook - reads stdin, writes JSONL event# Called by all four lifecycle hooks with different HOOK_EVENT valuesset -euo pipefailINPUT=$(cat)HOOK_EVENT="${HOOK_EVENT:-unknown}"AUDIT_FILE="${CLAUDE_PROJECT_DIR:-.}/.claude/audit.jsonl"TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%S.%3NZ")SESSION_ID=$(echo "$INPUT" | jq -r '.session_id // "unknown"')TOOL=$(echo "$INPUT" | jq -r '.tool_name // ""')COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command // .tool_input.file_path // ""')EXIT_CODE=$(echo "$INPUT" | jq -r '.tool_response.exit_code // ""')# Truncate output to first 200 chars for preview - full content not needed in auditOUTPUT_RAW=$(echo "$INPUT" | jq -r '.tool_response.stdout // .tool_response.content // ""')OUTPUT_PREVIEW=$(echo "$OUTPUT_RAW" | head -c 200)jq -nc \  --arg ts "$TIMESTAMP" \  --arg sid "$SESSION_ID" \  --arg hook "$HOOK_EVENT" \  --arg tool "$TOOL" \  --arg cmd "$COMMAND" \  --arg exit "$EXIT_CODE" \  --arg preview "$OUTPUT_PREVIEW" \  '{    ts: $ts,    session_id: $sid,    hook: $hook,    tool: $tool,    command: $cmd,    exit_code: $exit,    output_preview: $preview  }' >> "$AUDIT_FILE"exit 0

code

{  "hooks": {    "SessionStart": [      {        "hooks": [          {            "command": "HOOK_EVENT=SessionStart ./.claude/hooks/audit.sh",            "async": true          }        ]      }    ],    "PreToolUse": [      {        "matcher": ".*",        "hooks": [          {            "command": "HOOK_EVENT=PreToolUse ./.claude/hooks/audit.sh",            "async": true          }        ]      }    ],    "PostToolUse": [      {        "matcher": ".*",        "hooks": [          {            "command": "HOOK_EVENT=PostToolUse ./.claude/hooks/audit.sh",            "async": true          }        ]      }    ],    "Stop": [      {        "hooks": [          {            "command": "HOOK_EVENT=Stop ./.claude/hooks/audit.sh",            "async": true          }        ]      }    ]  }}

The "async": true flag is critical. Audit logging must not block tool execution. The hook writes the event in the background. The agent continues without waiting.

Querying the Audit Log

A JSONL audit log is queryable immediately with jq:

code

# Full session timeline - what happened and whencat .claude/audit.jsonl | jq -r '[.ts, .hook, .tool, .command] | @tsv' | column -t# Every tool call that produced a non-zero exit codecat .claude/audit.jsonl | jq 'select(.exit_code != "" and .exit_code != "0")'# All bash commands run in this session - the ground truth of what changedcat .claude/audit.jsonl | jq -r 'select(.tool == "Bash") | .command'# Every file the agent read or wrotecat .claude/audit.jsonl | jq -r 'select(.tool == "Read" or .tool == "Edit" or .tool == "Write") | .command'# Session duration: first to last eventcat .claude/audit.jsonl | jq -r '.ts' | sort | { head -1; tail -1; }

This is the minimum viable audit trail. It answers: what did the agent do, in what order, and did anything fail? It does not tell you why the agent made each decision - for that you need the session transcript.

Reading the Session Transcript

Claude Code writes a full session transcript to ~/.claude/projects/<project-hash>/. The transcript is a JSONL file where each line is a turn: the assistant message, the tool call, the tool result, the next assistant message. It is the complete record of what the model saw and what it decided at each step.

code

# Find the latest transcript for the current project# Claude Code stores transcripts in ~/.claude/projects/<hash>/# The hash is derived from your project path - find it with:TRANSCRIPT=$(find ~/.claude/projects -name "*.jsonl" -newer ~/.claude/projects \  2>/dev/null | sort -t'/' -k6 | tail -1)# Or list all projects to find yours:ls ~/.claude/projects/# Show every assistant decision with its tool callcat "$TRANSCRIPT" | jq 'select(.type == "assistant") | {  turn: .turn_index,  text_preview: (.content[] | select(.type == "text") | .text[:100]),  tool_calls: [.content[] | select(.type == "tool_use") | {tool: .name, input: .input}]}'# Show what the model saw at each step (the full context is reconstructable# by replaying turns in order - each turn's output becomes next turn's input)cat "$TRANSCRIPT" | jq 'select(.type == "tool_result") | {  tool_use_id: .tool_use_id,  result_preview: (.content[:200])}'

The transcript combined with the audit log gives you two complementary views: the audit log shows you what happened at the system level (which tools were called, what they returned, what was blocked). The transcript shows you what happened at the model level (what reasoning led to each tool call, what the model decided based on each result).

Layer 2: OpenTelemetry - Metrics, Events, and Traces

For teams running Claude Code at scale - multiple engineers, multiple worktrees, autonomous sessions running overnight - hooks-based logging is necessary but not sufficient. You need aggregated metrics, cross-session search, and distributed traces that show subagent relationships.

Claude Code ships native OpenTelemetry support. Enable it with three environment variables:

code

# Enable telemetryexport CLAUDE_CODE_ENABLE_TELEMETRY=1# Export metrics and events (choose your backend)export OTEL_METRICS_EXPORTER=otlpexport OTEL_LOGS_EXPORTER=otlp# Point to your OTel collectorexport OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318# Optional: enable distributed traces (beta)export OTEL_TRACES_EXPORTER=otlp# Optional: expose MCP and custom tool names in events# Without this, MCP tool calls are collapsed to "mcp" in eventsexport OTEL_LOG_TOOL_DETAILS=1

This single configuration change makes Claude Code emit structured telemetry to any OTel-compatible backend - Honeycomb, Datadog, Grafana/Loki, Jaeger, or a self-hosted collector.

What Claude Code Emits

Metrics (time series, queryable as counters):

Metric	What it tells you
`claude_code.cost`	Token spend per session, per user, per project
`claude_code.tokens`	Input/output token counts per API call
`claude_code.session`	Session count - autonomous vs interactive
`claude_code.lines_of_code`	LOC added/modified per session
`claude_code.pull_requests`	PRs opened per session
`claude_code.tool_decision`	Every tool call with allow/deny outcome

Events (structured log entries, queryable as individual records):

Event	What it tells you
`claude_code.user_prompt`	Every prompt submitted - the task the agent started with
`claude_code.tool_result`	Every tool execution with its outcome
`claude_code.api_request`	Every model call with token counts
`claude_code.api_error`	Every API failure with error type
`claude_code.hook_execution_start/complete`	Hook timing - where your safety gates are adding latency
`claude_code.skill_activated`	Which skills triggered on which tasks
`claude_code.compaction_event`	When and how context was compressed
`claude_code.subagent_stop`	When each subagent completed

Traces (beta): Distributed spans that correlate the orchestrator's tool calls with each subagent's execution. A trace shows you: the parent session started, it invoked three subagents, here is the timeline of what each subagent did, here is where the latency came from.

The Three Dashboards Every Claude Code Team Needs

Dashboard 1: Session Health

code

Panels:- Sessions per day (autonomous vs interactive)- Average session duration by session type- P95 tool call latency- Error rate: api_error events / api_request events- Context compaction events per session (proxy for session complexity)- Hook execution time: PreToolUse hooks P95 (watch for latency creep)

Dashboard 2: Cost and Token Breakdown

code

Panels:- Daily cost by user / by project / by model- Input vs output token ratio (high output ratio = verbose responses or large diffs)- Cost per PR opened (efficiency metric)- Top-cost sessions (outlier detection)- Subagent token overhead vs orchestrator: are subagents paying for themselves?

Dashboard 3: Tool Call Audit

code

Panels:- Tool call distribution: which tools are most used?- Deny rate by hook: which guards are firing most?- Silent failures: tool_result events with empty output + exit_code 0- File write frequency by directory (what is the agent touching most?)- Bash command categories (grep/read vs write/execute vs git operations)

Detecting the Three Failure Modes With OTel

Context-driven misalignment - Correlate compaction_event with subsequent api_error or suboptimal output events. Sessions that compact heavily are sessions where context state changed significantly mid-execution. High compaction + downstream errors is the signal.

Silent tool failure - Query for tool_result events where output_preview is empty string and exit_code is 0. This is the pattern: tool ran, returned nothing, was treated as success. Filter further by tool type - empty Grep results are normal, empty Read results on an existing file are not.

Drift detection - Correlate compaction_event timing with the timestamp of the divergence you found in the session transcript. The compaction that dropped the constraint is the event just before the timeline of wrong decisions begins.

Layer 3: Session Debugging Patterns

OTel and audit logs tell you what happened. Debugging tells you why it happened and what to change. Three patterns that work regardless of which instrumentation layer you have.

Pattern 1: Timeline Reconstruction

Before debugging anything, reconstruct the complete session timeline from available signals. Combine the audit log, the session transcript, and any hook output:

code

#!/bin/bash# .claude/scripts/session-timeline.sh# Reconstructs a readable timeline for a session IDSESSION_ID="${1:-}"AUDIT=".claude/audit.jsonl"TRANSCRIPT=$(ls -t ~/.claude/projects/$(echo $PWD | md5sum | cut -d' ' -f1)/*.jsonl 2>/dev/null | head -1)echo "=== SESSION TIMELINE: $SESSION_ID ==="echo ""echo "--- TOOL CALLS (from audit log) ---"grep "$SESSION_ID" "$AUDIT" 2>/dev/null | \  jq -r '[.ts, .hook, .tool, (.command[:60])] | @tsv' | \  column -techo ""echo "--- MODEL DECISIONS (from transcript) ---"cat "$TRANSCRIPT" 2>/dev/null | \  jq -r 'select(.type == "assistant") |    "Turn \(.turn_index): " +     (.content[] | select(.type == "text") | .text[:120])' 2>/dev/nullecho ""echo "--- ERRORS AND BLOCKS ---"grep "$SESSION_ID" "$AUDIT" 2>/dev/null | \  jq 'select(.exit_code != "" and .exit_code != "0" or .blocked == true)'

The timeline is the first thing you look at for any unexplained session outcome. It shows you the sequence of decisions without having to read the raw transcript.

Pattern 2: The Divergence Bisect

When you know the output is wrong but the session was long, use bisection to find where it went wrong:

Look at the final output. Identify the first wrong decision visible in the output (wrong file modified, wrong assumption made, wrong architectural choice).
Find that decision in the session transcript by searching for the relevant tool call or assistant text.
Read backward from that point in the transcript to find the last correct decision.
The divergence is between those two points. Read that section of the transcript carefully - what information did the model have at each step?
Cross-reference with the audit log for that timestamp range. Did any tool return empty or incorrect output during that window?

This is the agent equivalent of git bisect. You do not read the entire session - you find the boundary between correct and incorrect and read only the surrounding context.

Pattern 3: Counterfactual Testing

Once you have identified the divergence point and its cause, test your fix before the next full run:

code

# Replay a session from a specific point with modified context# Claude Code does not have native replay - simulate by creating a new session# with the context state that should have existed at the divergence point# 1. Extract the context state at the divergence point from the transcript# 2. Create a new Claude Code session with that context manually reconstructed# 3. Run the same task from that point# 4. Compare the outcome to what you expected# The fix is validated when the counterfactual run produces the correct output# The skill, hook, or CLAUDE.md change that prevents the divergence is your fix

Counterfactual testing is how skill entries, hook rules, and CLAUDE.md entries get validated before they are committed. You do not commit a fix and hope - you verify it produces the right outcome from the same starting conditions that produced the wrong one.

The Session Reconstruction Loop

Observability is not an end in itself. The return on instrumentation is the improvement loop it enables. I call this the Session Reconstruction Loop: the cycle where every unexpected session outcome becomes a diagnosed root cause, which becomes a targeted fix in the right layer, which prevents the same failure from recurring.

mermaid

flowchart LR
    A[Session runs]:::blue --> B[Audit log + OTel\nevents captured]:::teal
    B --> C{Outcome\nas expected?}:::purple
    C -->|Yes| D[Benchmark:\nbaseline this session\nfor regression testing]:::green
    C -->|No| E[Timeline reconstruction:\nfind divergence point]:::yellow
    E --> F{Root cause?}:::purple
    F -->|Context decision| G[Update CLAUDE.md\nor compress strategy]:::blue
    F -->|Silent tool failure| H[Add PostToolUse\nvalidation hook]:::blue
    F -->|Policy violation| I[Add PreToolUse\nblocking hook]:::blue
    F -->|Missing expertise| J[Write or update\na Skill]:::blue
    G --> K[Counterfactual test:\nverify fix from same start state]:::teal
    H --> K
    I --> K
    J --> K
    K --> A

    classDef blue fill:#4A90E2,color:#fff,stroke:#3A7BC8
    classDef purple fill:#7B68EE,color:#fff,stroke:#6858DE
    classDef teal fill:#98D8C8,color:#fff,stroke:#88C8B8
    classDef yellow fill:#FFD93D,color:#333,stroke:#EFC92D
    classDef green fill:#6BCF7F,color:#fff,stroke:#5BBF6F

Every session that produces an unexpected outcome is an input to one of the four fix mechanisms from the rest of this series. The root cause classification determines which layer gets updated:

Context decision → CLAUDE.md or compression strategy (Context Engineering article)
Silent tool failure → PostToolUse validation hook (Hooks article)
Policy violation → PreToolUse blocking hook (Hooks article)
Missing expertise → Skill definition (Agent Skills article)

Without observability you cannot classify the root cause. You can only observe that something went wrong and try different things. With observability, every failure is diagnostic. The loop closes.

Here is the pattern that produces the most expensive debugging sessions. An engineer enables Auto mode, walks away for 45 minutes, returns to find 30 commits, 15 PRs, and something wrong in the authentication layer:

code

# The engineer's options without observability:# 1. Read 30 git diffs and try to reconstruct the agent's reasoning from its outputs# 2. Ask Claude to explain what it did (it will hallucinate - it does not remember)# 3. Roll back everything and start over with a better prompt# 4. Ship it and hope the tests caught everything# None of these is debugging. All of them are archaeology.

The right way is that Auto mode sessions have hooks enabled before they run:

code

# Before enabling Auto mode on any session longer than 15 minutes:# 1. Confirm audit.sh hook is installed and writing to .claude/audit.jsonl# 2. Confirm CLAUDE_CODE_ENABLE_TELEMETRY=1 is set (for team sessions)# 3. Set a Stop hook that validates expected artifacts exist# 4. Set your maximum session token budget if your team admin has configured it# Then run Auto mode.# If something goes wrong, the audit log is your first read.# The session transcript is your second read.# The OTel dashboard is your third read.

An autonomous session without observability is a black box with consequences. You learn what it did from the damage, not from the record.

Production Checklist: Observability Before You Run Autonomous Sessions

Hooks-based audit (every project)

.claude/hooks/audit.sh installed and writing to .claude/audit.jsonl
All four lifecycle events covered: SessionStart, PreToolUse, PostToolUse, Stop
All hooks marked "async": true so they do not block tool execution
Audit log is append-only and never truncated mid-session
audit.jsonl is in .gitignore (it contains full command history including sensitive paths)

Session transcript access

You know where Claude Code writes session transcripts for your project
You have a script or alias to find and read the latest transcript
You can run the timeline reconstruction script for any session ID

OpenTelemetry (team and enterprise)

CLAUDE_CODE_ENABLE_TELEMETRY=1 set in team environment
OTel collector configured and receiving events
Session Health dashboard deployed and alerting on error rate threshold
Cost dashboard showing per-user and per-project breakdown
Tool Call Audit dashboard showing deny rate and silent failure pattern

Debugging practice

Before any autonomous session over 30 minutes: confirm audit hook is running
After any unexpected outcome: run timeline reconstruction before reading diffs
After any fix: counterfactual test from same starting conditions before committing
Every silent tool failure added to PostToolUse validation checklist

References

Anthropic / Claude Code Docs. (2026). Monitoring - OpenTelemetry configuration and available metrics. https://code.claude.com/docs/en/monitoring-usage
Anthropic / Claude Code Docs. (2026). Hooks reference - lifecycle events and exit code semantics. https://code.claude.com/docs/en/hooks
Anthropic / Claude API Docs. (2026). Managed Agents session tracing - Console observability. https://platform.claude.com/docs/en/managed-agents/observability
Huang, K. (April 2026). Chapter 9: Observability and Debugging (Claude Code vs Hermes Agent). https://kenhuangus.substack.com/p/chapter-9-observability-and-debugging
Huang, K. (April 2026). Claude Code Harness Pattern 9: Observability and Debugging. https://kenhuangus.substack.com/p/claude-code-harness-pattern-9-observability
simple10. (2026). agents-observe: Real-time observability dashboard for Claude Code agents. GitHub. https://github.com/simple10/agents-observe
Melodic Software. (2025). claude-code-observability: Monitor Claude Code interactions with observability hooks that log 14 events. Claude Plugin Hub. https://www.claudepluginhub.com/plugins/melodic-software-claude-code-observability-plugins-claude-code-observability
Doneyli. (January 2026). I Built my own Observability for Claude Code - Here's Why and How. Signal over Noise. https://doneyli.substack.com/p/i-built-my-own-observability-for
Nylas CLI. (2026). AI Agent Audit Logs from the CLI. https://cli.nylas.com/guides/audit-ai-agent-activity
OpenTelemetry Project. OpenTelemetry Specification - Logs, Metrics, Traces. https://opentelemetry.io/docs/specs/otel/

AI Engineering

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:

You Can't Debug What You Can't See: Observability for Claude Code Sessions

Why Agent Observability Is Different From Application Observability

Layer 1: Hooks-Based Audit Logging

The Audit Schema

The Four Audit Hooks

Querying the Audit Log

Reading the Session Transcript

Layer 2: OpenTelemetry - Metrics, Events, and Traces

What Claude Code Emits

The Three Dashboards Every Claude Code Team Needs

Detecting the Three Failure Modes With OTel

Layer 3: Session Debugging Patterns

Pattern 1: Timeline Reconstruction

Pattern 2: The Divergence Bisect

Pattern 3: Counterfactual Testing

The Session Reconstruction Loop

The Wrong Way: Flying Blind on Autonomous Sessions

Production Checklist: Observability Before You Run Autonomous Sessions

References

Comments

Why Agent Observability Is Different From Application Observability

Layer 1: Hooks-Based Audit Logging

The Audit Schema

The Four Audit Hooks

Querying the Audit Log

Reading the Session Transcript

Layer 2: OpenTelemetry - Metrics, Events, and Traces

What Claude Code Emits

The Three Dashboards Every Claude Code Team Needs

Detecting the Three Failure Modes With OTel

Layer 3: Session Debugging Patterns

Pattern 1: Timeline Reconstruction

Pattern 2: The Divergence Bisect

Pattern 3: Counterfactual Testing

The Session Reconstruction Loop

The Wrong Way: Flying Blind on Autonomous Sessions

Production Checklist: Observability Before You Run Autonomous Sessions

References

Related Articles

Comments