Taming the Chaos: Practical Strategies for Reducing LLM Non-Determinism

Table of Contents

You’ve written a prompt. It works beautifully. You ship it to production.

Three days later, someone reports wildly different answers to identical questions. You run the exact same input and get a different result than yesterday. Your test suite passes locally, fails in CI, passes again on re-run.

Welcome back to non-determinism in Large Language Models.

In my previous post, I walked through the six layers where randomness creeps into LLM systems. If you haven’t read it, the short version: LLMs are probabilistic systems, and that’s not a bug. It’s their nature.

This raises the obvious question: what do we actually do about it?

Here’s the thing: the answer depends entirely on what you’re trying to achieve. Non-determinism isn’t inherently bad. It’s the raw material for creativity, brainstorming, exploration. The goal isn’t deterministic LLMs—it’s controlled stochasticity. Randomness where you want it, predictability where you need it.

Think of it like conducting an orchestra. You don’t silence the musicians. You don’t make them all play the same note. You control when they play freely and when they follow the score exactly. Trying to eliminate all variance is asking for a symphony with one instrument playing one note. Letting variance run wild is cacophony.

What you want is a conductor’s control: creativity in the solo passages, precision in the ensemble sections.

Let me walk you through what actually works.

Temperature = 0 Doesn’t Mean What You Think It Does
#

Every tutorial mentions temperature and top-p. Lower temperature means lower entropy, right? Set it to zero, problem solved?

Not quite.

Temperature controls how “peaked” or “flat” the probability distribution is over candidate tokens. But temperature = 0 doesn’t equal true determinism. Vincent Schmalbach found [1] that greedy decoding still produces variance due to floating-point arithmetic and GPU parallelism. Michael Brenndoerfer ran identical prompts [2] on different GPU types and watched token probabilities shift just enough to cause divergence after a few words.

Random seeds help reproducibility when available, but they’re fragile. OpenAI’s documentation admits [3] results are only “mostly” deterministic even with fixed seeds. Seeds break with tool calls, streaming, context changes, and model updates.

Here’s the critical insight I wish someone had told me six months ago: sampling controls reduce variance in token selection, but they don’t constrain reasoning paths.

The model might express the same conclusion ten different ways while always selecting the most probable next token. If your goal is testable, reproducible outputs, sampling parameters are necessary but insufficient.

You’re tuning the instruments. But you’re not conducting the orchestra.

The Techniques That Actually Constrain Output
#

The most effective approaches work at a different level entirely. They constrain the output space rather than tweaking probabilities.

Structured Outputs: Collapsing the Possibility Space
#

JSON schemas, grammars, and typed outputs fundamentally change the game. When you enforce structure, you collapse the space of valid generations. The model has fewer choices about phrasing, ordering, formatting—and those were exactly the choices introducing the most variance.

Humanloop reported [4] going from roughly 35.9% reliability with prompt engineering alone to 100% reliable JSON schema conformance with OpenAI’s strict mode. Not 100% accurate content—the schema ensures the container is correct, not the facts inside it—but 100% consistent structure.

How does this work? Constrained decoding modifies the model’s logits in real-time to remove tokens that would violate the structure. The model can only generate valid continuations.

The trade-off is reduced expressiveness. For production use cases—API integrations, database updates, automated pipelines—that’s exactly the right trade.

You’re giving the model a very specific piece of sheet music to play.

Skills and Tools: When Code Should Do What Tokens Shouldn’t
#

This is where spec-kit [5], Claude Code skills [6], and similar approaches shine.

A skill is a deterministic function invoked by the model instead of free-form generation. Calculations. Validation. Formatting. Business rules. Database queries. Anything that has a correct answer should be offloaded to code rather than generated probabilistically.

I learned this the hard way. I once watched a system “calculate” tax rates by having the LLM reason through percentages in natural language. Different runs produced different rounding. Different rounding produced different final amounts. The tax authority was not amused.

Here’s what actually works: you invoke a skill with a deterministic tax calculation function. Same input, same output, every single time.

The anti-pattern - and I’ve seen this everywhere - is letting the model simulate a skill in text instead of calling it. Systems where the LLM role-plays doing math rather than invoking a calculator. The results aren’t calculations. They’re creative fiction about calculations.

Notice the pattern? Anywhere you need correctness, you need code. Anywhere you need understanding, you need the model. Don’t make the model pretend to be a computer. You have actual computers for that.

Specification-Driven Development: Making Requirements Executable
#

GitHub’s spec-kit [5] takes this further by making specifications themselves executable. Instead of writing code that sort-of matches fuzzy requirements, you write detailed specifications that drive implementation directly.

The seven-step process creates a paper trail of decisions. Each step constrains the next. By implementation, the space of valid outputs is narrowly defined.

Han Chung Lee’s deep dive [7] shows how this scales: “When requirements are ambiguous, implementations vary wildly. When requirements are precise, implementations converge.”

You were going to spend the time anyway—either up front in specification or later in debugging variance. I’d rather spend it once.

Writing Prompts That Reduce Variance
#

Not everyone can adopt new tooling. So what reduces variance through prompt engineering alone?

Eliminate underspecification. Replace “analyze this data” with concrete steps: “Calculate the mean, median, and standard deviation. Compare to last month’s values. Flag any metrics that changed by more than 10%.” Replace “be brief” with “2-3 sentences, maximum 50 words.”

Every ambiguous instruction is a branch point where different runs diverge.

Single-objective prompts. “Be concise, thorough, creative, and safe” is four conflicting goals. The model balances them differently each time. Prefer: “Optimize for correctness. Do not optimize for creativity.”

Explicit output examples. Few-shot examples anchor the output space more effectively than descriptive instructions. Research [8] consistently shows that concrete examples beat abstract specifications.

But here’s where it gets worse: even perfect prompts can’t overcome architectural chaos.

The Architecture That Kills You: Agents Without Constraints
#

Agentic systems multiply non-determinism. Every decision point compounds variance. And if you’re allowing self-critique loops or re-planning? The variance compounds exponentially.

Think about what happens. The agent plans. Executes step one. Reflects. Revises the plan. Executes the revised step two. By step four, you’re executing some mutated descendant, and that descendant differs every run.

Here’s the architectural pattern that actually works: separate planning from execution.

Planning phase: higher temperature. Let the model explore options, consider alternatives, propose approaches. This is where creativity lives.

Execution phase: low temperature, constrained tools, locked plan. The model follows the plan without changing it mid-stream. This is where reliability lives.

This isolates non-determinism to the planning stage. Execution becomes reproducible. When something goes wrong, you can diff plans rather than chasing variance.

Practical implementation: cap retry attempts at three. Disable self-critique loops in production—they’re useful for exploration but murder reproducibility. Log and replay tool traces.

The plan-then-execute pattern also limits blast radius for prompt injection. Once the plan is locked, injected content can influence execution details but cannot add unauthorized tool calls.

You get security benefits alongside reproducibility benefits. Rare in this space.

Retrieval Variance: The Hidden Killer in RAG Systems
#

If you’re building RAG systems, you have another layer of variance to manage: what gets retrieved and in what order.

Chunk ranking ties are more common than you’d think. Multiple documents with similar similarity scores get returned in different orders. Dynamic corpora change between queries. Embedding model updates shift the entire vector space. Many similarity search implementations are themselves non-deterministic due to parallelization and approximate nearest neighbor algorithms.

I once debugged a RAG system where identical queries returned documents in different orders 40% of the time. The model saw different context. Different context produced different answers. The team thought they had a prompt problem. They had a retrieval problem.

What actually works: fixed document versions with explicit version pinning. Deterministic ranking cutoffs—if scores are tied, use secondary sort keys like document ID. Stable chunk IDs that don’t change when content is re-indexed. Explicit citation requirements that force the model to reference specific retrieved chunks.

The principle: if you can’t make retrieval perfectly deterministic, at least make it inspectable. Log what was retrieved, in what order, with what scores. When outputs vary, you can trace back to whether retrieval or generation caused the divergence.

Notice the pattern? Every layer needs logging. Every layer needs inspection. You can’t debug what you can’t observe.

Defense-in-Depth: When Generation Can Be Probabilistic
#

Here’s a perspective shift: determinism isn’t just a generation problem.

You can accept probabilistic generation if your acceptance criteria are deterministic. Schema validation catches structural errors regardless of how they were generated. Rule-based checks enforce business logic. Deterministic post-processing normalizes output—canonical formatting, sorting, deduplication.

The strategy is defense-in-depth. Generate probabilistically. Validate deterministically. Reject and retry with explicit error feedback.

If the model gets it right 95% of the time and validation catches the other 5%, you’ve achieved practical determinism through rejection sampling rather than constrained generation.

This approach separates concerns. The LLM optimizes for content quality. The validation layer optimizes for structural correctness. Neither has to be perfect because the combination catches what each misses individually.

The Trade-offs Nobody Talks About
#

Be honest about what you’re giving up.

Tightly constrained outputs can’t surprise you—that’s the point—but they also can’t delight you with unexpected solutions. You’re trading serendipity for reliability. Structured outputs are predictable at the cost of natural variation. Your users will notice.

Systems optimized for reproducibility often struggle with edge cases. The flexibility you removed was also handling the unexpected. Skills, tools, and schemas create dependencies. Change the schema, break downstream consumers.

You’re not eliminating complexity. You’re relocating it from runtime to design time.

What Should You Actually Do?
#

If you need maximum reproducibility: temperature around 0.2, structured outputs, deterministic skills for calculations, fixed retrieval ordering, post-generation validation, comprehensive logging.

If you need controlled variation: higher temperature in planning, lower in execution, plan-then-execute architecture, bounded creativity within structured outputs.

But here’s the most important thing: decide where randomness is allowed to live.

Creative brainstorming? Let it run hot. API response formatting? Lock it down. Planning an agentic workflow? Allow exploration. Executing the plan? Constrain to the script.

The Mindset Shift
#

LLM non-determinism isn’t a bug to eliminate. It’s a parameter to control.

Most pilots in general aviation flying single-engine airplanes fear stalls. The word itself carries dread - loss of lift, loss of control, the plane spinning out of control and smashing into the ground.

I learned to fly in gliders first.

In glider training, stalls aren’t emergencies. They’re just … part of flying gliders. You practice them constantly, because when you’re curving in a thermal right at the edge of stall (to get the most lift), understanding your aircraft’s slow-speed behavior is essential. You learn exactly what happens at different bank angles, and different load factors. You practice recoveries until they’re muscle memory. You understand the aerodynamics - the angle of attack, the airflow separation, the predictable loss of lift.

Same aircraft behavior. Completely different relationship to it.

The difference isn’t the stall itself. The difference is understanding versus fear. General aviation pilots treat stalls as unpredictable dangers. Glider pilots treat them as known system behaviors with defined boundaries and practiced responses.

That’s the mindset shift for LLM non-determinism.

Engineering systems around LLMs means treating randomness like any other system behavior: measured, monitored, and managed according to requirements. Not feared. Not ignored. Controlled.

The organizations getting this right build architectures where determinism lives in the right places and variance lives in the right places. They know which is which. They know the conditions that cause variance to matter. They monitor the boundaries.

You’re not trying to silence the orchestra. You’re trying to conduct it.

That’s the difference between rolling dice in production and shipping systems you can trust.

I’d love to hear from practitioners dealing with non-determinism in production systems. What techniques have worked for you? Where have structured approaches broken down? What trade-offs have surprised you? Reach out on LinkedIn or BlueSky - let’s compare notes.