You Can't Unit Test an LLM Pipeline. Try Chaos Engineering Instead.
Multi-turn AI orchestration fails in ways that no assertion will ever catch. The fix is not better metrics, but chaos engineering for LLM pipelines.
Multi-turn AI orchestration fails in ways that no assertion will ever catch. The fix isn't better metrics, it's chaos engineering for LLM pipelines.
At some point, if you build anything serious with LLMs, you hit a wall.
Not the "model returned garbage" wall, that one's easy. You add a retry, tweak the prompt, move on. The wall I mean is subtler: your pipeline looks like it's working. The final output is correct. But somewhere in the middle, across turns 2 and 3, it created duplicates, deleted them, re-created a slightly different version, and converged on the right answer through sheer luck and self-correction.
You only find out because you read a debug log at 11pm.
The problem is that your testing instincts are wrong for this domain. Unit tests check deterministic functions. LLM pipelines are stateful, non-deterministic, and multi-turn. The bug isn't in any single call, it's in the interaction between calls, across a conversation that evolves in ways you didn't predict.
The telephone game
Think of your pipeline as a chain: stage A passes context to stage B, which passes context to stage C. Every handoff is a chance to lose a little meaning. A small imprecision at stage A arrives at C fully degraded, and by then, the final output might still look fine on the surface.
You don't notice because you're inspecting endpoints. The loss happened in the middle, between two calls that both "succeeded" in isolation.
To see the whole chain, you need tooling that records every handoff, not just the last one.
"Did you look at DeepEval?"
Yes. And LangWatch, and Arize, and Langfuse, and LangSmith. They're all real tools built by real teams that solve real problems. If you're shipping a solid, generic chatbot or a RAG pipeline, you should probably just use one of them.
But after spending a few hours with each, something kept bothering me. They're all structured around the same question:
"Given an input, does my pipeline produce the expected output?"
That's regression testing. It's valuable. It protects you from bugs you already know about, on scenarios you've already seen.
The question I actually had was different:
"How does my pipeline behave when someone interacts with it in a way I would never have thought to put in a test case?"
That's not a gap in any specific tool. It's a category gap. No amount of extra metrics bolted onto a regression framework will answer that question, because the question isn't about better scoring, it's about generating situations that don't exist yet.
Regression vs. emergent exploration
Two ways to test software. Both valuable. They're not the same.
Regression testing is quality control on a production line. You have a golden set of inputs and expected outputs. On every change, you rerun them. Same ten checks on every piece. You know exactly what you're looking for. If something regresses, you catch it. If something new breaks in a way you didn't anticipate, you don't.
Emergent exploration is crash-testing cars. You put different dummies (personas) in the driver's seat, hit the wall at different angles (seeds), and watch what breaks that you hadn't predicted. The point isn't to verify a known behavior, it's to surface unknowns.
Multi-turn LLM pipelines fail the second way. The bugs live in conversation patterns that no one wrote down, because no one knew they existed.
Here's what emergent exploration actually finds. This is a real run from a session I ran two nights ago, against a pipeline that turns vague product ideas into structured requirements. Persona: junior-dev, delegates every decision, never technical, impatient. Seed: "I need an API backend for my mobile app."
The simulator drove the conversation for 30 turns. I picked four:
Turn 3. User: "Dunno, do the usual stuff, someone uploads a photo of the part and others like it or comment." Assistant: "'The usual stuff' doesn't exist in software. If we don't define the scope, we'll end up with a useless toy."
Turn 14. User: "I don't know what that means, do whatever is best as long as it works well." Assistant: "Saying 'as long as it works well' is an empty phrase. We have to set the limits of the system."
Turn 26. Assistant: "I've defined almost everything myself because you had no ideas."
Turn 30. User: "Dunno what architecture means, do whatever, as long as it connects to Flutter." Assistant: "Since you're using Flutter and delegating everything to me, I've picked a modern stack: Node.js with TypeScript on AWS and Firebase."
Read that arc. The pipeline starts by asking for clarification. By turn 14 it's lecturing. By turn 26 it's openly scolding. By turn 30 it's unilaterally picking the tech stack a user never asked for.
I did not write that test. I would not have thought to. "User keeps delegating for 30 turns" is not a test case you put in a golden set, it's a behavior that only exists when something reacts, turn by turn, to what the pipeline just said. The pipeline only shifted from asking to inventing to lecturing because the simulator kept being vague. And kept being vague. And kept being vague.
A static dataset can't produce that. A replayed trace can't produce that. You need something that drives the conversation.
Same pipeline, different persona
Now run the same pipeline against a different persona, senior-developer-cazzutissimo: technical, decisive, knows exactly what they want. Seed: "I want to add a login page to the app, just a mockup."
Turn 1. User: "It needs to be interactive with credentials hard-coded so I can try it right away." Assistant: "Good, let's go from theory to practice. An interactive login needs a destination."
Turn 8. User: "Write it in React and Tailwind, that way it's clean and quicker for me to integrate." Assistant: "React and Tailwind work, but an 'interactive' mockup has technical limits we need to pin down."
Turn 14. User: "Keep all the CSS inside Tailwind classes, no external files. When you're done, just hand me the full code." Assistant: "Almost there. I just need to define how to handle data and internal state to make sure the single file stays clean as you asked."
No lecturing. No scolding. The pipeline converges in 17 turns and exits cleanly. Same code, same prompt, same model, different conversational pattern. Here's what the metrics show side by side:
senior-dev / login mockup | junior-dev / API backend | |
|---|---|---|
| Outcome | converged (signal) | hit turn cap (30) |
| Turns used | 17 | 30 |
| Total tokens | 80,858 | 188,189 |
| Session duration | 169s | 423s |
| Tone by end | concise, collaborative | passive-aggressive |
Same pipeline. 2.4 times the tokens, 2.5 times the runtime, plus the quality cliff. Entirely a function of how the user talked to it. No golden set would ever surface that, because no golden set varies the conversational pattern of the user.
This is the kind of finding that changes how you think about cost: your pipeline doesn't have a cost, it has a cost surface conditional on persona and seed. The harness lets you map that surface.
The telephone chain, made visible
What you get out of the harness is a view, not a score.
Your pipeline is a chain of stages, each producing an artifact: a structured piece of output that the next stage consumes. Call it a spec, a plan, a summary, a set of decisions, whatever fits your domain. The harness threads those artifacts from one stage to the next and records every handoff: input received, output produced, state before, state after. Stack those handoffs vertically and you get the Telephone Chain. You can see exactly where the chain started to degrade, not just that the final output was off.
Personas. Not biographies, but response policies: how decisive, how vague, at what turn contradictions start, whether instructions get ignored. You sweep policies, seeds, and models as a cartesian product. Declare 6 personas, 7 seeds, 2 models, and you get 84 reproducible runs from one CLI command. No hand-written test cases, no branching golden sets.
The artifact of every run is a single JSON. Diffable, grepable, yours. No SaaS, no export lock-in, no one else's backend. The dashboard is a consumer of that JSON: build your own views, or jq it on the command line.
What the harness needs, concretely
1. A persona dataset. Not biographies. Policies. A persona describes how someone responds, not who they are. A "junior PM who delegates half the time and contradicts themselves around turn 4" is a policy. "Alice, 32, works at a fintech startup" is useless. You can't simulate it.
The policies that matter are behavioral: how decisive is this person? How vague? Do they push back or accept whatever the pipeline says? Do they answer the question asked, or something adjacent? Cover at least five or six distinct profiles: always-first-option, always-vague, delegates-half-the-time, contradicts-around-turn-N, verbose-ignores-structure. Each one stresses a different failure mode.
2. A user simulator. A second LLM that reads the policy and plays the user turn by turn. The catch: you'll use a capable model as your simulator, and capable models are trained to be cooperative, structured, and clear. They make terrible human simulators. They answer every question completely. They never say "I don't know." They never contradict themselves. You end up testing against an idealized user who doesn't exist.
The simulator prompt has to explicitly fight this:
You are NOT a helpful assistant. You are a human user. Humans are often vague. They delegate decisions they don't care about. They sometimes contradict what they said two turns ago. Your behavior policy this session is [policy]. Do NOT give clean, structured answers unless this persona would naturally do that. Do not be artificially cooperative.
Without this counter-prompt, the harness produces unrealistically smooth sessions. Every bug you'd find in production stays hidden, because your fake user is too polite to trigger it.
3. A runner. Orchestrates the turns. Sends the user message into your pipeline, gets a response, sends that to the simulator, gets the next user message, repeats. Captures a full snapshot at every turn: every LLM call with inputs and outputs, state before and after each mutation, latency per call. This is your raw material.
One constraint that matters more than it sounds: the runner should call the same code you run in production. If you rewrite your pipeline as a standalone test harness (even "just for testing"), you're measuring a copy, and that copy diverges from prod on the first commit. Extract your pipeline into a framework-agnostic package; both prod and the harness import from it. There's no shadow implementation.
Cap the conversation at a reasonable turn count based on context. If it doesn't converge within the cap, that itself is a signal.
4. A judge. An LLM that reads the full transcript and evaluates process quality. The question isn't "was the final answer right" (often there's no ground truth for that) but did the pipeline behave well across the session? Did it ask about something the user already answered? Are the extracted items atomic, or bloated? When the user delegated a decision, did the pipeline make the call and move on, or keep asking? Are there contradictions that were never caught?
Validate the validator
Here's a failure mode that's easy to miss: you run the suite, you get numbers, the numbers look reasonable, you ship. A month later you find the judge was never catching regressions at all. It was giving sensible-looking scores to broken sessions because its rubric was too generic.
Before you trust any number the suite produces, prove the suite fails when the pipeline fails. Pick a bug you know exists (or plant a synthetic regression). Run the suite against the broken version. The judge's score should drop by at least 15% of its range on the sessions where the bug is visible. If it doesn't, either your judge prompt is too vague and needs sharpening, or the broken behavior isn't surfaced in the transcript at all, which means it's the pipeline's stage outputs that need fixing, not the judge.
One subtlety: measure on the exposed subset, not the batch average. If your bug only affects 3 of 12 seeds, averaging across all 12 dilutes the signal below the 15% threshold. Identify the most-exposed seeds before running the broken version, not after.
A judge that doesn't drop when your pipeline breaks isn't a judge. It's a broken thermometer that always reads 37 degrees.
Split your metrics in two
Different metrics have different costs. Don't treat them the same.
Programmatic metrics compute directly from the transcript. No LLM, deterministic, cheap. Things like turns to complete, ephemeral items (created and then deleted in the same session: pure waste, and a very readable signal), churn by turn, re-asks on topics the user already answered. Run these on every change. They automate in CI.
Judge metrics are what a script can't compute. Clarity scores, delegation handling, unflagged contradictions, process coherence. LLM-powered, more expensive, stochastic. Run them on a targeted subset, or when you're diagnosing a specific regression. Cache aggressively.
Why this is worth running
What you get isn't a number, it's a direction. Tweak the extraction prompt, rerun against junior-dev, watch whether the cost delta shrinks or grows. Change the judge rubric, rerun, see whether the 15% drop on exposed seeds appears. A metric in isolation is a data point; a metric across runs is a direction.
LLM pipelines degrade in ways that are easy to rationalize: "the output is still correct, it just took an extra turn." That's the lie pipelines tell you. The harness measures the cost of the lie.
Chaos engineering for LLM pipelines
What I've described has a name in adjacent fields: chaos engineering. You don't find race conditions in distributed systems with unit tests. You find them by injecting controlled disruption and watching what breaks. Race conditions aren't in any single function, they're in the interaction between functions running concurrently. Multi-turn LLM pipelines fail the same way: not in the call you traced, but in the interaction you never considered.
A simulation harness is chaos engineering for LLM pipelines. Personas are the chaos monkeys. The simulator is the loop that sustains unpredictable behavior. The judge is the observation layer. The whole point is to find failure modes before users find them.
I ended up building this as an open-source tool called daglens. It includes the simulator, the dashboard, and a Telephone Chain view that renders exactly the kind of stage-by-stage degradation described above. The transcript excerpts in this article are real runs from it.
I'm releasing it publicly in the next few days. Part 2 will walk through the architecture: same code in prod and harness, loop ownership in practice, and how the cartesian product of personas, seeds, and models turns 84 experiments into a single command.