Article · 13 June 2026

Pinning the Model: Building Inference You Can Reproduce

The seed, the weights, the floating-point order, the retrieval set: determinism is a build choice, and proving it afterwards is the other half of the job.

Author

Micky Irons

Published

13 June 2026

Follow Micky Irons

LinkedIn X

reproducible inferencedeterminismfloating-point orderretrieval-augmented generationmodel versioning

Pinning the Model: Building Inference You Can Reproduce

The same prompt, twice, two answers

Run the same prompt through the same model on a Monday and again on a Friday. You can get two different answers. Not wildly different, usually. A reordered list. A softened claim. A number that drifts by a decimal. For a chatbot writing birthday messages, nobody cares. For a system that approves a loan, flags a transaction, or drafts a clinical summary, that drift is the whole problem. Because the day a regulator, an auditor, or an opposing lawyer asks you to reproduce a decision, "it was roughly that" is not an answer. "Here is the exact output, and here is how I can recreate it bit for bit" is. Reproducible inference is the difference between a system you can stand behind and a system you merely hope is behaving.

I run Mickai, a Sovereign Intelligence Operating System (SIOS), built and live. We treat determinism as an engineering requirement, not a nice-to-have, because everything downstream of the model collapses if you cannot pin what the model actually did. The audit record, the offline verification, the legal defensibility: all of it rests on being able to say, with proof, exactly what happened. So this piece is about the unglamorous mechanics. The seed, the model version, the floating-point order, the retrieval set, the decoding parameters, and the assembled context. Pin all of them and inference becomes a function with a known output. Leave any one loose and you have a slot machine wearing a lab coat. And reproducibility is not something that happens to a model. It is something you build, knob by knob, and then have to prove.

What "reproducible" actually means here

Let me be precise, because the word gets abused. Reproducible inference means this: given the same inputs, the same model, and the same configuration, the system produces the same output, and you can demonstrate that to someone who does not trust you. There are two strengths of the claim and they are worth separating. The weaker one is statistical reproducibility, where outputs are stable in distribution. Run it a thousand times and the answers cluster tightly around the same place. The stronger one is bitwise reproducibility, where the output is identical down to the token, every time, on the hardware you committed to. Most teams quietly aim for the weak version and hope nobody notices the difference. For anything that carries legal or safety weight, you want the strong version, or you want to know exactly why you cannot have it and what you are doing instead.

The honest caveat up front. Full bitwise determinism across arbitrary hardware is hard, sometimes practically impossible, and I am not going to pretend otherwise. A result computed on one graphics processing unit (GPU) architecture may differ in the last decimal place from the same computation on another, because the chips schedule arithmetic differently. So reproducibility is always reproducibility relative to a pinned environment. The job is not to deny that constraint, because anyone who claims determinism is free is selling you something. The job is to name the environment precisely enough that the constraint stops mattering, and then to commit that environment to a record that cannot be quietly edited later.

The seed is the easy part, and it is still not enough

Everyone starts with the seed, because it is the first knob you find. When a model samples its next token, it draws from a probability distribution using a pseudo-random number generator. Fix the seed and you fix that draw sequence. Set temperature to zero and in principle you stop sampling altogether and take the most likely token every time (greedy decoding), which removes the randomness entirely. Good. Do both and you have eliminated the most obvious source of variation. This is the part every tutorial covers, and it is necessary work.

Here is the trap. Temperature zero is not actually deterministic in many real systems, and teams discover this the hard way. Greedy decoding picks the highest-probability token, but when two tokens are nearly tied, the winner can flip based on tiny numerical differences in how the probabilities were computed. Those differences come from everything below the seed: the order operations ran in, the batch you were processed alongside, the precision of the arithmetic. So the seed is necessary and nowhere near sufficient. People who stop at the seed and declare victory are the same people who later cannot explain why their "deterministic" pipeline gave a customer a different answer on appeal. The seed pins the dice. It does nothing about the table the dice are rolling on.

The model version is a moving target you have to nail down

"We used the latest model" is not a specification. It is an alibi waiting to fail. Model providers update weights, swap quantisation schemes, re-tune safety layers, and retire endpoints, often behind a name that does not change. The string you call might point at different weights this quarter than it did last quarter. If your record says "the model" and the model underneath has moved, you cannot reproduce anything, and worse, you will not even know that you cannot until you try, which is usually the moment you most need it to work.

Pinning the model means pinning the artefact, not the label. That is the exact weights file and its cryptographic hash, the quantisation format (whether the weights run at full precision or are compressed to smaller numbers, which changes outputs), the tokenizer version (the rules that turn text into the units the model reads), the inference engine and its version, and any adapter or fine-tuning layer applied on top. At Mickai we hold our own weights for exactly this reason. We are actively training our own specialised sovereign models now, hardening them on a sealed corpus, with funding scaling toward fully native weights. Sovereignty over the artefact is not a branding line. It is the only way to guarantee the weights I verified on Tuesday are the weights answering on Thursday, because nobody can silently change them under me. If you rent inference from a black box that mutates without notice, your reproducibility story has a hole in the middle that you do not control and cannot inspect.

Floating-point order: where determinism quietly dies

This is the section most people skip, and it is the one that actually bites. Computers do not add numbers the way mathematics does. With floating-point arithmetic, the order of operations changes the result, because rounding happens at every step. Add a thousand numbers left to right and you can get a fractionally different sum than adding them right to left, or in chunks. In pure mathematics, addition is associative, so the order is supposed to be irrelevant. In a real processor, it is not. That fact, small as it sounds, is the root of most non-determinism in modern inference, and it hides below every higher-level setting you might think to control.

Now put that on a GPU, which gets its speed by doing thousands of additions in parallel and combining them in whatever order finishes first. The combination order can vary run to run depending on how the work was scheduled. Then add batching. Most serving systems group your request with other users' requests to use the hardware efficiently, and the size and composition of that batch can change the reduction order, which changes your result, even though nothing about your input changed. Your output became a function of who else was talking to the server at the same millisecond. That is a deeply unsettling property for anything that needs to be defensible, and it is invisible unless you go looking for it on purpose.

You can fight this, and you have to decide how hard. The options, in rough order of cost: pin the precision so the arithmetic format never silently changes; use deterministic kernels (the low-level routines that promise a fixed reduction order at some performance cost); force a fixed batch shape, or run batch size one for the requests that must be reproducible, accepting that you are trading throughput for certainty; and pin the hardware class, because the same code on a different GPU generation can still diverge in the last place. None of this is free. All of it is a build choice. The point is to make the choice on purpose and write it down, rather than inherit non-determinism by default and find out in front of an auditor that your pipeline was never as fixed as you said it was.

The retrieval set is part of the model now

Most serious systems no longer run the model alone. They retrieve documents and feed them in as context (retrieval-augmented generation, or RAG), so the model answers using your knowledge base rather than its frozen training. This is good for accuracy, and it is a second, often larger, source of non-reproducibility that the seed and the floating-point work do nothing to fix. If the retrieved context changes, the answer changes, full stop, and the model was deterministic the whole time. You pinned the engine and left the fuel tank open. I have watched teams chase phantom non-determinism in the decoder for days when the real culprit was a knowledge base that had quietly grown by a hundred documents overnight.

Think about everything that moves under retrieval. The knowledge base itself gets updated, so the same query pulls different documents next month. The embedding model (which turns text into the vectors used for similarity search) can be swapped, re-ranking everything. The index can be rebuilt with different parameters. Approximate nearest-neighbour search, used because exact search is too slow at scale, is itself non-deterministic by design: it trades a little accuracy for speed and may return slightly different neighbours run to run. And tie-breaking between documents with equal scores can be arbitrary. To pin inference you must pin retrieval. Snapshot the corpus at a known version, freeze the embedding model and index parameters, record the exact document identifiers and chunk boundaries that were retrieved, and capture the ranking that produced the final context. The reproducible unit is not the prompt. It is the prompt plus the exact set of documents that were in the room when the answer was written.

Decoding parameters and the rest of the loose change

Beyond seed and temperature sit a handful of decoding parameters that quietly steer the output, and every one of them must be pinned and recorded. Top-k and top-p (nucleus sampling) restrict which tokens can be chosen and change the result even at fixed temperature. Repetition and frequency penalties reshape the distribution as generation proceeds. The maximum token length can truncate an answer at a different point if it moves. Stop sequences decide where generation halts. The system prompt and any tool definitions injected ahead of the user's text are part of the input even though the user never sees them. Change a system prompt and you have changed the experiment while pretending you did not, and the change will not show up anywhere a user thinks to look.

There is also a clock problem people forget. If your prompt template inserts the current date and time, or a request identifier, or anything else that changes between runs, then your input was never the same input, and of course the output differs. Reproducibility demands that you treat the entire assembled context, every token the model actually saw, as the thing you are pinning, not the friendly user-facing prompt you think you sent. Most of the "mysterious" non-determinism I have seen turned out to be a timestamp or a randomised identifier smuggled in by a template nobody had read closely. The fix is boring and reliable: log the fully assembled context, byte for byte, and treat that, not the human-readable prompt, as the input of record.

Determinism is a build choice, with a bill attached

Here is the part I want to be straight about, because the security-realist in me distrusts anyone selling determinism as free. Pinning everything has costs. Batch size one and deterministic kernels are slower and more expensive, sometimes substantially. Freezing the knowledge base means you are deliberately not using your freshest data for the requests you pinned, which can be exactly the wrong trade for some use cases and exactly the right one for others. Holding model versions still means you forgo improvements until you choose to re-pin and re-validate. These are real tensions, and the engineering maturity is in deciding where reproducibility is worth the bill and where statistical stability is genuinely good enough.

My rule of thumb: the higher the consequence of the decision, the further you push toward bitwise reproducibility, and the more of the cost you swallow. A creative drafting assistant can live with statistical stability and cheap throughput, and it should, because paying for determinism there is waste. A system making decisions that affect someone's money, liberty, health, or rights should be pinned hard, because the day you need to reproduce it is the day everything else has already gone wrong, and that is the worst possible moment to discover you cannot. The European Union (EU) Artificial Intelligence (AI) Act brings high-risk obligations into force from August 2026, and the broad direction of travel on AI liability and post-quantum migration all points the same way. You will increasingly be asked not just whether your system was right, but whether you can show, after the fact, exactly what it did. Determinism is what makes that question answerable instead of embarrassing.

Pinning is worthless if you cannot prove it later

Now the turn, and it is the part that matters most. Suppose you do all of the above. You pin the seed, the weights and their hash, the precision and kernels, the retrieval snapshot, every decoding parameter, the full assembled context. You have made inference reproducible. Congratulations, and here is the uncomfortable question. Months later, when it counts, how does anyone know you actually used those settings and not some convenient story written afterwards? A configuration you can edit after the fact proves nothing. A log your own server controls is a log your own server can rewrite. Reproducibility that rests on "trust our records" is not reproducibility. It is a promise, and promises are exactly what fail under scrutiny, because the moment you most need them is the moment everyone has the most reason to doubt them.

This is the gap Mickai was built to close, and it is why the pinning work above is the setup, not the punchline. In our SIOS, every AI action is written to the Open Audit Record (OAR). The record captures the inputs, the pinned configuration, the retrieval set, and the output, and it is signed before the action executes, not after, so the commitment exists ahead of the result rather than being assembled to fit it. The record is hash-chained and append-only, so you cannot quietly change an earlier entry without breaking the chain. The signatures are post-quantum, using the United States National Institute of Standards and Technology (NIST) standard Federal Information Processing Standard (FIPS) 204, the Module-Lattice Digital Signature Algorithm at security level ML-DSA-65, so they do not rot the moment cryptography moves on. And, the part I care about most, the whole thing is verifiable offline in an ordinary browser, with no trust in Mickai required. The chain anchors to a sovereign Layer 1, Pantheon, which roots the audit history to Bitcoin, so even the anchor is not ours to rewrite. None of this is a sketch. The SIOS is built and live, and Pantheon, with its fixed-supply token PAN of five billion units, is the one piece we still have in build.

This posture is not theoretical paperwork either. The architecture behind it sits in 104 filed United Kingdom patent applications, 2,340 claims, owned by Mickai LTD with me named as inventor. I say filed, which already means what it means, and I leave it there. Read the two halves of this essay together and the thesis lands. Pinning the model makes inference a function with a known output. The signed, offline-verifiable record makes that function provable to someone who has every reason to doubt you. One without the other is half a system. Determinism without an honest record is a claim you cannot back. A record without determinism is a faithful account of a coin flip. You need both. Build the inference to be reproducible, then commit it to a record that does not depend on anyone's good faith, including ours. That is what turns "trust me, it was roughly that" into "here is the exact output, here is the exact configuration, verify it yourself." In a field drowning in confident outputs nobody can check, the unglamorous discipline of pinning the model, and then signing what you pinned, is what separates systems you can defend from systems you can only hope about.

ShareLinkedIn X Hacker News Reddit Mastodon Bluesky Email

Originally published at https://mickai.co.uk/articles/pinning-the-model-reproducible-inference. If you operate in a regulated sector or want sovereign AI on your own hardware, the audit form on mickai.co.uk is the entry point.