Article · 14 June 2026

When Models Eat Their Own Output, Lineage Is the Only Defence

Synthetic data is now training the next generation of models. Without a chain of custody, we are building intelligence on ground we cannot inspect.

Author

Micky Irons

Published

14 June 2026

Follow Micky Irons

LinkedIn X

synthetic datadata provenanceAI governancemodel collapsepost-quantum

When Models Eat Their Own Output, Lineage Is the Only Defence

The fog rolling into the training set

Here is a fact most people building artificial intelligence (AI) systems would rather not sit with. A growing share of the data used to train new models was itself produced by older models. Synthetic text, synthetic images, synthetic code, synthetic conversations, all generated, scraped back up off the open web, and fed forward into the next run. We are no longer training machines on the world. Increasingly, we are training machines on the residue of earlier machines.

I am not against synthetic data. Used deliberately, it is one of the most useful tools we have. It fills gaps where real data is scarce, expensive, or legally radioactive. It lets you cover edge cases that almost never occur in the wild. The problem is not that synthetic data exists. The problem is that we have stopped being able to tell what is what. Lineage is collapsing. And when lineage collapses, every downstream claim about a model, that it is fair, that it is safe, that it is compliant, rests on ground nobody can actually inspect.

Why lineage collapse is not a theory

Think about how data moves now. A model generates a body of text. That text gets published, indexed, and crawled. A second team harvests it, not knowing or not caring where it came from, and mixes it into a fresh training corpus. A third model trains on the result. By the time anyone asks where a particular pattern entered the system, there is no answer. The trail has gone cold three hops back, and nothing in the pipeline was built to record those hops in the first place.

Researchers have a tidy name for the failure mode at the extreme end of this. When models repeatedly train on their own kind of output, the tails of the distribution thin out, rare but real signal gets smoothed away, and quality degrades run over run. Call it model collapse, call it distribution drift, the mechanism is the same. The system slowly forgets the parts of reality it saw least often, and it does so without ever throwing an error. There is no crash. There is just a model that is quietly, confidently worse, and a team that cannot prove why because they cannot reconstruct what went into it.

The honest security view here is the one I keep coming back to. You do not get to assume your inputs are clean. You assume they are contaminated until proven otherwise, and you build the machinery to prove it. We have spent two decades learning that lesson about software supply chains. We learned it the hard way with dependencies, with packages pulled blindly from public registries, with the sinking realisation that nobody could enumerate what was actually running in production. Synthetic data is the same problem wearing different clothes. It is a supply chain. We are just refusing to treat it like one.

The poison case, not just the quality case

Quality degradation is the gentle version of the argument. The sharper version is adversarial. If an attacker knows that models are training on web-scraped output, they have a cheap and durable way in. Seed the open web with carefully shaped synthetic content. Wait for it to be harvested. Let it land in someone's training set. This is data poisoning, and it does not require breaking into anything. It requires patience and an understanding that the pipeline trusts whatever it ingests.

You cannot defend against this with a better model. A bigger network trained on poisoned, unverifiable data is just a more capable expression of the poison. The only defence that holds is being able to answer a brutally simple question for any piece of data: where did you come from, who made you, when, and has anyone touched you since. If you can answer that with evidence, you can quarantine, exclude, or down-weight tainted lineages before they ever reach a training run. If you cannot answer it, you are flying blind and calling it scale.

Provenance is a record problem, not a labelling problem

The instinct of the industry has been to reach for watermarks and labels. Mark the synthetic output, the thinking goes, and we will know it when we see it. I want to be careful here, because watermarking is useful and I am glad people are working on it. But a label on a single artefact is not provenance. Provenance is the unbroken history of an artefact, every step it took, recorded in a way that cannot be quietly rewritten after the fact. A watermark tells you a thing might be synthetic. A chain of custody tells you exactly which model produced it, under which configuration, from which inputs, and proves that the record has not been edited since.

That distinction is everything, and it is the same distinction that separates a sticky note from an evidence log. A label can be stripped, forged, or simply ignored downstream. A proper record of custody has to be tamper-evident by construction. The moment provenance becomes something you can edit after the event, it stops being provenance and becomes marketing. If the record can be quietly revised to flatter whoever holds it, then it tells you about that party's preferences, not about the data.

What a real chain of custody has to do

So let me be concrete about the properties that actually matter, because this is where most well-meaning efforts fall short. A chain of custody is not a log file you write up at the end of the week. It is a discipline imposed on the pipeline at the moment work happens, and it has to satisfy four hard conditions before it is worth anything at all.

It must be signed before the fact, not described after it. A record written by the same party that benefits from it, after the work is done, is a story. The signature has to be bound to the action at the moment the action happens.
It must be append-only and hash-chained. Each entry references the one before it, so any later tampering breaks the chain visibly. You cannot delete an inconvenient step without the gap showing.
It must survive the cryptography transition that is already underway. Migration to post-quantum signatures is no longer hypothetical, it is policy in motion across governments and standards bodies. A provenance record signed with cryptography that a future machine can forge is a liability with a delayed fuse.
It must be verifiable by someone who does not trust you. If checking the record requires calling the vendor's server and believing the answer, you have not solved trust, you have relocated it. The verification has to work offline, in an ordinary browser, with no dependence on the party that produced the data.

Miss any one of those four and the whole thing quietly fails. A record that is hash-chained but verifiable only through the vendor still asks you to trust the vendor. A record that is signed before the fact but with cryptography a future machine can break is a record with an expiry date you cannot see. These properties are not a menu. They are a set, and they hold or fall together.

Why this is becoming non-optional

There is a regulatory tide behind all of this, and it is worth naming plainly. From August 2026, the European Union (EU) AI Act brings its obligations for high-risk systems into force, and those obligations lean hard on data governance, traceability, and the ability to demonstrate how a system was built. Liability for AI failures is rising across jurisdictions, and the question regulators and courts will ask is not whether you meant well. It is whether you can show your working. A model whose training data has no inspectable lineage is a model you cannot defend, in an audit or in a courtroom.

This is the part that should concentrate the mind of anyone shipping AI into anything that matters. The cost of provenance feels like overhead right up until the day someone asks you to prove what you trained on, and you discover the honest answer is that you have no idea. At that point lineage stops being an engineering nicety and becomes the difference between a system you can stand behind and one you can only apologise for. The teams that build the record early will find the audit boring. The teams that did not will find it expensive.

Where I have put my own bet

I run Mickai, a Sovereign Intelligence Operating System (SIOS), and it is built and live, not a slide. I will not pretend it is a neutral party in this argument, so let me just tell you what we built and why it follows directly from everything above. Inside the system, every AI action is recorded in what we call the Open Audit Record (OAR). The record is signed before the action executes, not narrated afterwards. It is hash-chained and append-only, so the history cannot be quietly rewritten. It is signed with post-quantum cryptography, the United States National Institute of Standards and Technology (NIST) standard FIPS 204, ML-DSA-65, so the signatures do not rot when the cryptographic ground shifts. And critically, it verifies offline, in an ordinary browser, with no trust placed in us as the vendor.

That last property is the whole point. We are actively training our own models now, specialising them and building a sealed corpus, and the discipline that protects that corpus is the same discipline I am arguing every serious team needs. For data lineage to mean anything, it must be anchored somewhere nobody can reach in to edit, including us. That is why the audit root is designed to anchor outward to Pantheon, our sovereign Layer 1, which in turn anchors to Bitcoin, rather than living only on infrastructure we control. The record is only as trustworthy as your inability to tamper with it. That is the whole bet, and it is the only honest one I know how to make.

The choice in front of us

We are at a fork that does not announce itself. One path is to keep scaling, keep ingesting, and keep telling ourselves that the data is probably fine. That path ends in models built on sediment nobody can core-sample, degrading quietly, poisoned occasionally, and indefensible the moment they are challenged. The other path is duller and harder. It treats synthetic data as a supply chain that must carry its history with it, signed, sealed, and checkable by a stranger.

I know which one sounds like progress. I am telling you the boring one is the only one that survives contact with reality. The fashionable move is to assume the problem will be solved later by someone cleverer, with a bigger model and a smarter filter. It will not. You cannot filter your way out of a corpus whose history you never recorded. Lineage is not paperwork. It is the foundation, and right now most of the industry is building without one.

ShareLinkedIn X Hacker News Reddit Mastodon Bluesky Email

Originally published at https://mickai.co.uk/articles/provenance-for-synthetic-data. If you operate in a regulated sector or want sovereign AI on your own hardware, the audit form on mickai.co.uk is the entry point.