The Provenance of a Generated Molecule
In AI-driven pharmaceutical research, the candidate compound is the easy part. The defensible record of how it was generated is what survives a regulator and a court.
The molecule that no one can explain
Picture a candidate compound that clears your assays, survives toxicology, and earns a place in a first-in-human trial. It looks like a triumph. Then a regulator, or worse, a court, asks a simple question. How did you arrive at this structure? If the honest answer is that a model proposed it, you have a chemistry problem and a paperwork problem at the same time. The chemistry you can run again. The paperwork, the chain of decisions that turned a statistical suggestion into a dosed human being, is the part that either holds or collapses under examination. I have spent the last few years building infrastructure for exactly this kind of question, and I want to be plain about what I have learned. In drug discovery, the generated molecule is the easy part. The provenance of its generation is the defensible part, and almost nobody is capturing it properly.
This is not an argument against using artificial intelligence (AI) in pharmaceutical research and development (R&D). It is an argument that the value of an AI-generated candidate is bounded by your ability to prove where it came from. A brilliant molecule with no derivable history is a liability dressed as an asset. The drug may be real. The defence of how you found it may be fiction, and you will not know which until someone with subpoena power asks you to show your working.
What a generative model actually does to your evidence chain
Traditional medicinal chemistry leaves a trail by default. A chemist forms a hypothesis, sketches an analogue, runs a synthesis, records a yield, files a notebook entry. Each step is human, dated, and attributable. The reasoning is legible because a person wrote it down, and the artefacts (notebooks, spectra, assay plates) are physical or near-physical. When AI enters the loop, that natural legibility evaporates. A generative model does not explain itself. It samples from a learned distribution and returns a structure, or ten thousand structures, with a score. The reason why is distributed across billions of parameters and the specific state of the system at the moment of inference.
Consider the realistic pipeline. A target is selected. A model trained on protein structures and known binders proposes scaffolds. A second model filters for synthesisability and predicted toxicity. A docking routine ranks poses. A human chemist reviews a shortlist, picks three, and sends them to synthesis. Every one of those stages is a decision, and every decision was shaped by a model version, a training corpus, a random seed, a temperature setting, a filtering threshold, and a prompt or configuration that a person chose. None of that is captured by saving the final structure-data file. You have the answer. You have lost the question and the working.
The gap matters because the interesting failures in AI-assisted discovery are not chemistry failures. They are provenance failures. The model was trained on a corpus that included a poisoned or proprietary set of compounds. The filtering threshold was loosened on a Friday and never reverted. The candidate that reached trial was actually generated by a different model version than the one named in the method section. Each of these is invisible if you only kept the molecule, and each of them is the kind of detail a determined opponent will go looking for precisely because it is the kind of detail you did not keep.
The two readers who decide everything: the regulator and the court
There are two readers you are really writing your record for, and they read differently. The first is a regulator. A medicines authority does not need to love your AI. It needs to be satisfied that your process was controlled, that your data has integrity, and that you can reconstruct how a decision was made. The long-standing principles for data integrity in regulated manufacturing and laboratory work (often summarised as attributable, legible, contemporaneous, original, accurate, and the extensions around completeness and consistency) were written for paper and instruments. They apply with more force, not less, when a model is in the loop. A regulator wants to know that the record was made at the time, by an identified actor, and cannot have been quietly edited afterwards.
The second reader is a court, and the court is more adversarial. In product liability litigation, the question is not whether your process was reasonable in general. It is whether you can prove, against an opponent who is paid to disbelieve you, that this specific molecule was derived the way you say it was. Discovery will pull your model versions, your configuration history, your change logs. If your provenance is a spreadsheet that any employee could have edited last week, opposing counsel will say so, and a jury will hear it. The asymmetry is brutal. You have to prove derivation. They only have to introduce doubt about your records. A record that depends on trusting your own internal systems is a record built to lose.
These two readers converge on the same demand. They both want an account of the generation that was fixed at the moment it happened, that names who or what acted, and that cannot have been reshaped to fit the story you now wish to tell. That is provenance. Not the molecule. The history of the molecule. Serve that single account well and you have satisfied both readers at once, because the regulator's idea of integrity and the court's idea of authenticity are, at bottom, the same idea wearing different robes.
Reproducibility is not provenance, and conflating them is dangerous
A common response is that you can just re-run the pipeline. Reproducibility is necessary and it is not sufficient. Re-running tells you that a model, given inputs, can produce a similar output today. It does not tell you that this candidate was produced that way on the day it was produced. Generative systems are frequently stochastic. Sampling temperature, hardware non-determinism in floating-point operations, library versions, and updated weights all mean that today's run is a cousin of the original, not a twin. If you cannot pin the exact model version, the exact seed, the exact configuration, and the exact inputs as they stood at the time, your reproduction is a re-enactment performed by an understudy.
There is a deeper trap. Models get retrained. Corpora get updated. The system that generated last year's candidate may no longer exist in the same form, because you improved it. That is good engineering and terrible evidence hygiene if you did not record the prior state. Provenance is the discipline of capturing the world as it was at the moment of generation, so that you are not asking a changed system to vouch for a decision it no longer remembers making. Reproducibility asks whether you can do it again. Provenance asks whether you can prove what you did. A court cares about the second, and the gap between the two is where most discovery programmes are quietly exposed.
The supply chain you forgot you depend on: training data and model lineage
A generated molecule inherits the biases and the contamination of everything upstream of it. If your scaffold generator was trained partly on a competitor's published compounds, or on data you did not have the rights to use, the candidate carries that origin whether you logged it or not. Intellectual property disputes in this field will increasingly turn on training-corpus provenance, because the question of whether a molecule is meaningfully derived from data you were not entitled to learn from is a question about the model's lineage, not about the structure in isolation. The structure is innocent. The pedigree may not be.
This is where most AI provenance efforts are thin. People log the inference call and ignore the training history. But the defensible record has to reach back. Which model version, trained on which corpus snapshot, fine-tuned with which curated set, under which data-use terms. When a model is updated, the lineage has to be recorded as a chain, so that any candidate can be traced not just to a model but to a precise ancestor of that model. We approached our own systems with the assumption that the model's training is itself an auditable event. We are actively training our own models now, fine-tuning and specialising open foundations (Llama 3.2 and Qwen 2.5) and building a sealed corpus, and our funding scales that toward fully native weights. We treat each training and specialisation step as something that must be recorded with the same rigour as an inference, because tomorrow's litigation will ask about both.
What a defensible generation record has to contain
Strip away the marketing and a defensible provenance record for a generated candidate needs a specific set of facts, captured at the moment of generation and bound together so they cannot be separated or silently altered. From the work I have done, the irreducible list looks like this.
- The exact model identity and version, including a content hash of the weights, not just a friendly name.
- The training and fine-tuning lineage of that model, traceable to specific corpus snapshots and data-use terms.
- The complete inputs: the target, the constraints, the prompt or configuration, and the random seed.
- The inference parameters: temperature, sampling method, filtering thresholds, ranking criteria.
- The output as generated, before any human curation, with the curation recorded as a separate, attributed step.
- The identity of every actor, human and machine, who touched the decision, and the time each one acted.
- An ordering that proves what came before what, so the sequence of decisions cannot be rearranged after the fact.
Notice that the last two items are not about chemistry at all. They are about who acted and in what order, and they are exactly the items that ordinary logging handles worst. A log file records events but rarely binds them in a way that survives a hostile audit. Anyone with write access can append, edit, or quietly reorder. That is the weakness a good opponent will attack first, and it is the weakness that almost every pipeline I have examined still carries without realising it.
Why ordinary logging fails the adversarial test
Most organisations believe they already have provenance because they have logs. They have a database that records pipeline runs, a version control system for code, an electronic notebook for chemists. These are useful and they share one fatal property under adversarial scrutiny. They are trust-me records. They prove your process to anyone who already trusts your infrastructure and your staff. They prove nothing to a regulator or a court that does not. A database row can be updated. A timestamp can be backdated by an administrator. A log can be regenerated. The defence that your internal systems say so invites the obvious reply, which is that your internal systems are controlled by you.
The fix is not more logging. It is a different kind of record. The record has to be created before the action it describes, so it cannot be a post-hoc reconstruction. It has to be cryptographically signed by the actor, so attribution is mathematical, not administrative. It has to be hash-chained and append-only, so that altering any earlier entry breaks every entry after it and the break is detectable by anyone. And critically, it has to be verifiable by a third party who trusts none of your systems, using nothing more than ordinary tools. If verification requires calling your servers or trusting your software, you have not escaped the trust-me problem. You have just moved it one layer down and hoped no one would look.
Signed before it executes, verifiable offline, anchored beyond your reach
This is the heart of what we built into Mickai, our Sovereign Intelligence Operating System (SIOS), and I will describe it as a design principle rather than a sales line, because the principle is what matters for drug discovery. Every AI action in the system is written into an Open Audit Record (OAR). The defining property is the order of operations. The record of an action is signed before the action executes, not logged after. That single inversion is what defeats the most damaging objection in litigation, the claim that the record was assembled later to fit the result. You cannot retrofit a candidate's history if the history was sealed before the candidate existed. The signature is not a description of the past. It is a commitment made in the present that the future cannot quietly revise.
The record is hash-chained and append-only, so the sequence of decisions (target selection, generation, filtering, human curation) is locked into an order that cannot be rearranged. The signatures use post-quantum cryptography, specifically the United States National Institute of Standards and Technology (NIST) standard for module-lattice digital signatures (FIPS 204, ML-DSA-65), because a record meant to hold for the decade-plus lifetime of a drug programme has to survive the arrival of quantum computers that would break today's ordinary signatures. A provenance record that becomes forgeable in eight years is not a provenance record. It is a time bomb with a politely delayed fuse.
Two further properties close the loop. The record is verifiable offline, in an ordinary web browser, with no trust in the vendor and no call home. A regulator, an opposing expert, or your own future self can take the record and the public verification logic and confirm the chain independently, on an air-gapped machine if they wish. And the audit root is anchored externally. In our case it is anchored to an independent Layer 1 we call Pantheon, which itself anchors to Bitcoin, so the existence and ordering of your records is witnessed by a system entirely outside your control. The point of external anchoring is simple. It removes the last trust-me. Even if every server you own were compromised, the anchored root would expose any attempt to rewrite history.
Honest caveats, because provenance is not a magic wand
I am a security realist, so let me be clear about what this does and does not do. A signed, offline-verifiable record proves derivation. It does not prove that the molecule is safe, effective, or non-infringing. Those are separate questions answered by trials, by chemistry, and by patent analysis. Provenance proves the process, not the merit. Anyone who tells you a cryptographic audit trail makes a drug safe is selling you something, and it is not honesty.
There is also a garbage-in problem. If you sign a record that faithfully captures a flawed process, you have a verifiable account of a flawed process. The record makes your process legible and tamper-evident. It does not make a bad decision good. That is a feature, not a bug. The point is to make the truth durable, and sometimes the durable truth is that a threshold was set carelessly. Better to know, and better that the record shows you knew when. And provenance cannot reach data you never captured. If the training corpus was assembled without recording its sources, no downstream signing fixes that gap. The discipline has to start at the corpus, not at the inference call. These limits are real, and pretending otherwise would undermine the very credibility the record is meant to create.
The regulatory weather is changing, and it favours the prepared
The direction of travel is unmistakable. High-risk AI obligations under the European Union (EU) Artificial Intelligence Act begin to bite from August 2026, with requirements around record-keeping, traceability, and human oversight that read almost as if written for this exact problem. AI liability is rising across jurisdictions, and the burden is shifting toward those who deploy the systems to demonstrate that they were controlled. Post-quantum migration is moving from advisory to expected, and standards bodies have published the algorithms organisations are meant to adopt. None of these trends individually mandates the architecture I have described. Together they describe a world where the ability to produce a fixed, attributed, tamper-evident, future-proof record of an AI decision moves from competitive advantage to baseline expectation.
Pharmaceutical companies operate on decade-long horizons. A candidate generated now may face its decisive legal and regulatory scrutiny in the 2030s, under rules and cryptographic expectations stricter than today's. The cheapest time to build provenance is at the moment of generation. The most expensive time is during discovery in a courtroom, reconstructing a history from systems that were never designed to defend themselves. You do not get to go back and sign a record after the fact. That is the whole point, and it is the one part of this argument that no future tooling will ever soften.
The defensible asset
So here is where I land. The generated molecule is a hypothesis. Its scientific value is decided by chemistry and clinical evidence, as it always has been. But its institutional value, whether it is an asset you can defend or a liability you have to explain, is decided by provenance. A candidate with a signed, hash-chained, post-quantum, offline-verifiable record of its derivation is an asset that can answer the regulator and survive the court. A candidate with a folder of editable logs is a bet that no one will ever ask the hard question, and in this field, someone always asks. The infrastructure to make that bet unnecessary already exists and is in production. I built it because I would not want to be on the wrong side of that question myself.
My thesis is narrow and I will not dress it up. In AI-driven drug discovery, you are not really protecting molecules. You are protecting the record of how they came to be. Build that record so that it was sealed before the action, names every actor, cannot be reordered, will not be forged by tomorrow's computers, and can be checked by someone who trusts you not at all. Do that, and the question of how you derived this candidate stops being a threat and becomes an answer you can hand over. That is the defensible record, and in a field where lives and verdicts ride on the chain of reasoning, it is the difference between an innovation and an exposure.


