Article · 15 June 2026

The Weights You Signed Are Not Always The Weights That Answer

Silent weight tampering does not crash anything. It just answers fluently and wrongly, and waits for the one input that was the point. Here is how you prove the model serving today is the model you sealed.

Author

Micky Irons

Published

15 June 2026

Follow Micky Irons

LinkedIn X

model integrityweight tamperingAI supply chain securityOpen Audit Recordpost-quantum signatures

The Weights You Signed Are Not Always The Weights That Answer

The model you tested is not always the model that answers

Here is an uncomfortable fact about how almost every artificial intelligence (AI) system on earth is run today. The team trains a model, evaluates it, ships it, and then trusts that the file sitting on the inference server is the same file they signed off on. They check it once, at deploy time, and never again. Between that moment and the millionth request, a great deal can happen. A weight file is a flat array of numbers, hundreds of gigabytes of floating point values with no built-in notion of identity. Change a few thousand of them and the file still loads, still runs, still produces fluent output. It just produces different output, on inputs you may never test. The model you evaluated and the model that answers your customers can diverge, quietly, and nothing in the standard stack will tell you.

I call this silent weight tampering, and I think it is one of the most underrated risks in production AI. Not because it is exotic, but because it is invisible. A crashed server pages someone in the middle of the night. A corrupted database throws an error that lands in a dashboard. A tampered model just keeps talking, politely, in the same voice it always used. This piece is about how you detect that, how you prove the weights serving today are the weights you signed, and why the honest answer requires more than a checksum buried in a deploy script. The conclusion I will land on is narrow and, I think, unarguable: integrity you can only confirm by asking the operator is not integrity at all.

What a weight actually is, and why that makes tampering easy

To see the problem clearly you have to drop the abstraction. A neural network's behaviour lives entirely in its parameters: the weights and biases, stored as tensors of numbers in a format like safetensors, a Generic GPT Unified Format (GGUF) file, or a set of sharded checkpoints. There is no source code that defines what the model believes. There is only the matrix. When you fine-tune, you are nudging those numbers. When you quantise, you are compressing them into fewer bits. When you merge two models, you are averaging them. All of these are legitimate operations that change the numbers, which is exactly why illegitimate changes hide so well among them. The file format cannot tell a sanctioned edit from a sabotage, because to the format they are the same act.

Consider the resolution available to an attacker. A modern language model in the seven to seventy billion parameter class holds billions of individual values. Behavioural research on AI systems has repeatedly shown that you can implant a targeted behaviour, a so-called backdoor, by altering a vanishingly small fraction of those parameters. The model behaves normally on everything you would think to test, and then misbehaves only when it sees a specific trigger phrase, a specific customer identifier, a specific document pattern. The aggregate accuracy on your benchmark barely moves. The thing you care about, the behaviour on the one input that matters, is compromised. This is not a hypothetical drawn from any single paper. It is a well-established direction in the security literature on machine learning, and the trend over recent years has been toward more efficient, more targeted, harder-to-spot manipulation, not less.

So the attack surface is enormous and the signal is faint. That asymmetry is the whole problem. Detection cannot rely on noticing that the model got worse, because a competent tamper does not make it worse on average. It makes it different in one place. You can run every evaluation you own, watch every aggregate metric hold steady, and ship a model that has been quietly rewired to fail on a trigger you will never think to type. The defender has to cover an essentially infinite input space. The attacker only has to win on the inputs they chose in advance. Any defence that depends on catching a degradation is fighting the wrong fight.

Where tampering actually comes from

It is tempting to imagine a hooded figure breaking into a data centre at night. The real threat model is more mundane and far more likely. Weights move through a long supply chain before they ever serve a request, and every hop is an opportunity for something to be substituted, corrupted, or quietly amended. You do not need a state-level adversary for any of this. You need an ordinary system with ordinary gaps.

Weights are downloaded from a public model hub, which can be compromised or can host a maliciously modified mirror. They are pulled through a content delivery network and a chain of caches, any of which can serve a substituted file. They are stored in object storage with access keys that, in practice, far more people hold than anyone admits. They are loaded by an inference framework that may apply a runtime patch, a quantisation pass, or a low-rank adaptation (LoRA) adapter that silently overrides a slice of the base behaviour. They are copied between staging and production by a deploy script nobody has read in a year. And then they sit, hot, in graphics processing unit (GPU) memory for weeks, where a sufficiently privileged process or a memory-mapped page swap can alter them in place without ever touching the file on disk.

Notice that most of these are not movie-villain scenarios. They are ordinary operational events: a bad cache, a leaked key, an over-eager optimisation flag, a process running with more privilege than it should. The supply chain for model weights is at least as long as the software supply chain we have spent a decade learning to worry about, and it is far less instrumented. We learned, painfully, after a string of public incidents, to sign our software packages and verify them on the way in. We have, mostly, not yet learned to do the same for the numbers that actually decide what our AI says. The weights are the most behaviourally consequential artefact in the entire pipeline, and they are very often the least protected.

Why the obvious defences are necessary but not sufficient

The natural first answer is a hash. Compute a cryptographic digest of the weight file, store it, and compare on load. This is genuinely good practice and you should do it. But be honest about what it buys you. A load-time hash proves the file on disk matched a known value at the instant you read it. It says nothing about the next four weeks the model spends resident in memory. It says nothing if the attacker also controls the place you stored the expected hash. And it says nothing about whether the value you are comparing against is itself the right one, because a checksum is only ever as trustworthy as the record that holds it. A hash that the checked party can rewrite is not a proof, it is a sticky note.

Move up a level and you reach attestation: hardware roots of trust, secure enclaves, measured boot, a Trusted Platform Module (TPM) signing a report that says this exact binary loaded on this exact machine. This is stronger and it is the right direction. But attestation answers the question, can the operator prove to themselves that the expected thing loaded. It does not, by itself, answer the question that actually matters in a dispute: can a third party who does not trust the operator verify, later and independently, what actually ran. Most enclave attestation flows terminate at the vendor's own verification service. You ask the vendor's endpoint whether the vendor's machine was honest, and the vendor's endpoint says yes. You are back to trusting the party you were trying to check.

There is a deeper gap still. Even a perfect file-integrity story tells you the weights are the weights. It does not tell you that the weights you have are the weights you were entitled to serve, signed by the people who trained them, at the version your auditor approved. Integrity without provenance is a locked door with no idea who holds the key. You can prove nothing changed since load, while still having loaded the wrong thing entirely, signed by no one, accountable to nothing. The question is not only has this file been altered, but also is this the file we agreed on, and who vouched for it, and can someone outside the building confirm that answer without taking my word for it.

Continuous integrity: proving it is still true at request time

If a one-time check is the weakness, then the fix is to stop treating integrity as an event and start treating it as a property you re-establish continuously. The model's identity should be something you can assert at the moment of inference, not just at the moment of deployment. A deploy-time check answers a question about the past. A request needs an answer about the present, because the present is when the answer is being given and when the harm, if any, is being done.

In practice this means a few concrete things. Hash the weights into a structured commitment rather than a single opaque digest. A Merkle tree over the tensor shards lets you verify the whole model, or cheaply re-verify a subset, and pin down exactly which shard differs if one does, rather than learning only that something, somewhere, is wrong. Re-attest periodically against memory, not just against disk, so that an in-place change to resident weights is caught rather than assumed away. Bind the model's committed identity to the runtime that serves it, so a given answer can carry, or reference, a proof of which model produced it. And, crucially, make the expected value live somewhere the operator cannot quietly edit after the fact, because an integrity check whose reference value is mutable by the checked party is theatre dressed as security.

That last clause is where most designs quietly fail, and it is the hinge of this whole argument. You can build an immaculate continuous-verification pipeline and still learn nothing if the ledger of what should be true is a row in a database the operator owns. The hard part of detecting silent tampering was never the hashing. Hashing is a solved problem, and has been for decades. The hard part is the trust placed in whoever holds the answer key. Solve the cryptography and leave the custody of the record unsolved, and you have not removed the vulnerability. You have only moved it from the weights to the database, where it is quieter and harder to see, which is worse.

The custody problem, stated plainly

So let me state it as plainly as I can. Every integrity scheme reduces, eventually, to a comparison against a trusted reference. The entire security of the scheme rests on the trustworthiness of that reference and on who is able to change it. If the operator who serves the model is also the operator who controls the expected hashes, the signing keys, and the audit log, then a sufficiently motivated or sufficiently compromised operator can tamper with the weights and update the reference to match. The check passes. The paperwork is immaculate. The dashboard is green. The model lies, and every record agrees that it did not.

This is not paranoia about cartoon villains. It is simply the structure of the problem. Insiders make mistakes. Keys leak. Logs get rotated. Pressure arrives from commercial or political places that did not exist when the system was designed. A record that the interested party can rewrite is not evidence, it is a press release. For high-stakes AI, the regulatory weather is moving in exactly this direction: under the European Union (EU) Artificial Intelligence Act, the high-risk obligations arriving from August 2026 lean hard on logging, traceability, and record-keeping that can withstand outside scrutiny, not just internal comfort. The same logic that has driven the migration toward post-quantum cryptography, anticipating that today's signatures must survive tomorrow's attackers, applies directly to integrity records: the proof has to outlast the trust you currently place in the people holding it, because that trust is exactly what is on trial when something goes wrong.

What an answer looks like: sign before you serve, verify without the vendor

This is the design we chose to build into Mickai, our Sovereign Intelligence Operating System (SIOS). I will describe the principle rather than sell the product, because the principle is the part that generalises to any serious AI system. Mickai is built and in production, not a roadmap promise. The 50 brains, 25 domain and 25 operational, run on the Poseidon silicon substrate, and every one of them is a model whose identity we commit to before it ever produces an answer. The discipline is the same for every one of our specialised sovereign models, including the models we are increasingly training ourselves.

The mechanism is the Open Audit Record (OAR). The governing rule is that every AI action is signed before it executes, not after. The commitment to which model, which weights, which configuration is created ahead of the action, then hash-chained into an append-only record so that each entry seals the one before it. You cannot quietly rewrite history without breaking the chain, and breaking the chain is exactly the thing a verifier looks for first. The signatures are post-quantum, using the United States National Institute of Standards and Technology (NIST) standard Federal Information Processing Standard (FIPS) 204, the Module-Lattice Digital Signature Algorithm at the ML-DSA-65 parameter set, so the record is built to survive a future attacker rather than only today's. Signing after the fact would let an operator decide what to record once they had seen the outcome. Signing before execution closes that door.

The part that actually closes the custody gap is the last one: the record is verifiable offline, in an ordinary web browser, with no trust in the vendor. You do not call our application programming interface (API) to check us. You do not rely on our enclave service to vouch for us. You take the record and the public commitments and you verify the chain yourself, on a machine we do not control, including against us. That is the inversion that matters. Integrity you can only confirm by asking the operator is not integrity, it is courtesy. Integrity a hostile third party can confirm without our cooperation is the real thing, and it is the only version that means anything in a courtroom, an audit, or an argument.

We anchor deeper still. The Pantheon chain, a sovereign Layer 1 blockchain that anchors the audit root to Bitcoin, is the one component still in build, and its job is to put the root of that record somewhere no single party, including us, can revise. Its token, PAN, has a fixed supply of five billion. The portfolio that protects all of this is real and on the public record: 104 filed United Kingdom patent applications, 2,340 claims, owned by Mickai LTD, with myself as the named inventor. We are also actively training our own specialised sovereign models now, hardening them on a sealed corpus we control, with funding scaling toward fully native weights. At every stage of that progression the rule does not move: whatever the provenance of a weight, its identity is committed and signed before it serves.

Honest caveats, because the realist owes you them

I would not trust a vendor who told you this solves everything, so let me draw the lines myself. A signed, verifiable record proves which weights ran and that they were not altered after they were sealed. It does not prove the model is good, fair, or safe. A signed backdoor is still a backdoor. The record tells you faithfully that you served a compromised model, which is genuinely useful for forensics and accountability, but it is not the same thing as having stopped the model from being compromised in the first place. Integrity is necessary, not sufficient, and anyone who blurs that distinction is selling you a feeling rather than a guarantee.

There are real costs, and I will name them. Continuous attestation and Merkle re-verification consume compute cycles, and you have to engineer them carefully so they do not throttle inference under load. Key management is its own hard discipline: a signing system is only ever as honest as the custody of its keys, which is exactly why anchoring the root beyond any single operator matters so much, because it removes the one place where a leaked key could otherwise rewrite the past. And none of this defends against an attacker who compromises the model before the signing step, at training time, in the data, upstream of the commitment. What signing before you serve does guarantee is that from the sealed moment onward, the truth about which weights ran is fixed and checkable by anyone. That is a narrower claim than total safety, and it is precisely the one I am willing to stand behind in writing.

The standard to hold every AI system to

Strip away the vendor names, including mine, and a single test remains. Ask of any production AI system: can a party who does not trust the operator prove, after the fact and without the operator's cooperation, exactly which model produced a given answer, and that those weights were not silently altered. If the answer is no, then detecting silent weight tampering has been left to faith, to the hope that the deploy script was honest, the cache was clean, the keys never leaked, and the log was never rewritten by anyone with a reason to rewrite it. Faith is a fine thing in its place. It is not a security control.

That standard is not a feature request, and it is not specific to us. It is the difference between an AI you are asked to trust and an AI you are able to verify. The numbers that decide what your model says deserve the same scepticism we eventually learned to apply to the software supply chain, and then one degree more, because a tampered weight file does not crash, page anyone, or throw an error. It just answers, fluently and wrongly, and waits patiently for the one input that was the point of tampering in the first place. The defence is to make the model's identity something it signs before it speaks, hash-chained, post-quantum, anchored, and checkable offline against the very operator who runs it. Build that, and tampering stops being silent. Skip it, and you are trusting a number you never get to see, held by a party you were never able to check.

ShareLinkedIn X Hacker News Reddit Mastodon Bluesky Email

Originally published at https://mickai.co.uk/articles/weights-you-signed-versus-weights-that-answer. If you operate in a regulated sector or want sovereign AI on your own hardware, the audit form on mickai.co.uk is the entry point.