Article · 3 May 2026

Multimodal AI without provenance is a deepfake factory. The 2026 fix is per-frame signing, voice gating, and a consent envelope around every output.

GPT-5.5, Gemini, and the Meta multimodal stack can now emit a video clip indistinguishable from real footage. None of them ship a per-frame cryptographic signature, a survivable watermark, a voice-biometric gate, or a consent envelope. This article sets out what real multimodal provenance looks like, by reference to filed patents (GB2608825.2, GB2608826.0, GB2608824.5, GB2608799.9, GB2608827.8, GB2608830.2), and what regulators are about to require.

Author

Micky Irons

Published

3 May 2026

multimodal-aiprovenancedeepfakesovereign-aimickai

The clip you cannot prove

In the first quarter of 2026, the major frontier labs all shipped a multimodal generation tier that crosses a threshold worth naming. GPT-5.5 produces a video clip of a named person, in a specific room, saying a specific sentence, and the clip is indistinguishable from real footage on the metrics that matter (lip sync, micro-expression, gaze tracking, ambient acoustic match, lighting consistency). Gemini's equivalent tier reaches the same threshold by a different route. Meta's open-weights multimodal stack reaches it cheaper. None of these systems ship a per-frame cryptographic signature bound to the natural person depicted, none of them ship a watermark that survives a single re-encode, and none of them ship a consent envelope that lets a downstream verifier prove the depicted person agreed to the clip's existence.

That is not a missing feature. That is a missing architecture. The labs have spent eighteen months arguing about model alignment and capability evaluations while the structural problem has been sitting in plain sight: a multimodal model that can synthesise a person without a chain back to that person is, by construction, a deepfake factory. Every clip it emits is a forgery in waiting. The only question is whether the forgery is detected before it does damage, and the labs have ceded that question to journalists, fact-checkers, and the open-source detector community. None of those communities are going to win.

The provenance gap in the big-lab approach

The big-lab response to deepfake risk has converged on three weak measures. First, a visible watermark or a metadata tag that any re-encode strips in milliseconds. Second, a content-credentials field in the file header that any re-mux removes without trace. Third, a model-side refusal layer that declines to generate certain named individuals, easily defeated by anyone willing to fine-tune a smaller model or to describe the target without naming them. None of these measures binds the output cryptographically to a natural person who consented. None of them survives the adversarial pipeline that any motivated forger will run the clip through within seconds of its release. None of them gives a verifier on the receiving end a mathematical answer to the question: did the depicted person agree to this?

The structural gap is the absence of three properties that any sovereign multimodal system has to satisfy by construction. First, every emitted frame, every emitted audio segment, every emitted text fragment must carry a cryptographic signature bound to the natural person whose likeness or voice is depicted. Second, the signature must survive the adversarial transformations a real adversary applies (re-encode, crop, resample, format conversion, partial overlay). Third, every output must travel inside a consent envelope that records, under the depicted person's hardware-attested key, the scope of the agreement to that specific output. The big labs have shipped none of the three. The Mickai filing portfolio is composed, by construction, around all three.

What real multimodal provenance looks like

Mickai's approach to multimodal provenance is structural, not policy-themed. Each of the following filed patents addresses one of the structural properties the big-lab approach is missing, and the composition of the six is what closes the deepfake gap.

GB2608825.2 (MWI-PA-2026-009), Attestable Avatar Rendering with Per-Frame Cryptographic Signing Bound to a Natural-Person Identity, addresses the per-frame signature requirement directly. Every rendered frame of an avatar carries a signature under a key whose private half is held in hardware controlled by the natural person depicted. A verifier looking at a single frame, in isolation, can establish whether that frame was rendered with the depicted person's hardware-attested authorisation. A frame produced by a different system, by a different rendering pipeline, by anyone other than the authorised pipeline, fails verification by construction.

GB2608826.0 (MWI-PA-2026-011), Dual-Layer Audio Watermark for Sovereign AI Outputs with Voice-Gated Production, closes the survivability gap. Audio outputs carry a dual-layer watermark designed to survive the adversarial transformations a forger actually runs (re-encode through lossy codecs, resample, format conversion, partial overlay with other audio). The production gate requires a voice-biometric sample from the speaker whose voice is being synthesised, recorded under that speaker's hardware-attested key, before the output is emitted. A synthetic voice clip without the gate is not produced by the authorised pipeline by construction.

GB2608824.5 (MWI-PA-2026-007), ChatClone Anti-Deepfake System with Consent-Bound Provenance Chain, addresses the consent envelope. Every output of the ChatClone pipeline travels with a consent envelope that records the scope of the depicted person's agreement to that specific output, signed under their hardware-attested key, with the consent record itself written to an append-only log. A verifier can extract the envelope, verify the signature, check the scope against the actual content, and answer the question: did this person agree to this?

GB2608799.9 (MWI-PA-2026-013), Voice-Biometric-Gated Deterministic LLM Tool Invocation, applies the voice gate to the audio modality of LLM tool use. Where the model's tool invocation is triggered by a voice utterance, the gate verifies the utterance against the authorised speaker's biometric template, recorded under their hardware-attested key, before the tool call is permitted. The audit record carries the voice-biometric verification result alongside the tool invocation, which means a verifier can establish, for any tool call attributed to a voice, whether the voice matched the authorised speaker.

GB2608827.8 (MWI-PA-2026-006), Sovereign Voice-Biometric Identity System Adapted for Extreme Environments, generalises the voice-biometric layer to operating environments where the standard consumer-device assumptions fail (high-noise industrial floors, marine environments, defence deployments, medical environments with masking acoustic signatures). The point of including this in a multimodal-provenance discussion is that voice biometrics that work only in a quiet office are not a sovereign primitive. The extreme-environment adaptation is what makes the voice-gating layer durable across the actual deployment surface.

GB2608830.2 (MWI-PA-2026-002), Multi-Brain Cooperative Intelligence Architecture, sits at the cross-modality reasoning layer. A multimodal output is rarely produced by one model in one pass; it is produced by a composition of specialised models reasoning across modalities (a visual model, an audio model, a language model, sometimes a code model). The multi-brain architecture treats each as a separately attested actor with its own signed reasoning trace, which means the cross-modality composition is itself auditable. A verifier can establish, for any multimodal output, which constituent models contributed which fragments, under whose authorisation, and with what consent envelope.

A worked example

Consider a deepfake video posted to a major social platform on a Monday morning. The clip shows a named UK politician saying something they did not say. It is shared two million times before lunch. By Tuesday it is on the front page of three newspapers. By Wednesday the politician has issued a denial, which fewer people see than the original clip. By Friday the fact-checker community has produced a forensic analysis, which fewer people still will read. The damage is done before the verification cycle completes, and the verification cycle does not produce a mathematical answer; it produces a probabilistic judgement that a determined audience will reject.

Now consider the same clip arriving in a world where the Mickai provenance chain is the default. A verifier on the receiving platform extracts the per-frame signatures (009) and finds them absent or invalid. The verifier extracts the audio watermark (011) and finds it absent or stripped. The verifier looks for the consent envelope (007) and finds none signed under the depicted politician's hardware-attested key. The verifier looks for the voice-biometric gate record (013) and finds no authorised utterance attached to the tool invocation that produced the clip. The verifier returns a structured result: this clip is not from any authorised pipeline; the depicted person did not consent to its existence; the audio is not from the depicted person's voice. The result is mathematical, not probabilistic, and it is available within the latency budget of the share button. The clip is rejected at the edge.

That is the structural difference. The big-lab approach hopes the detector community keeps up. The Mickai approach makes the verifier's answer a property of the system, not a property of the analyst.

Why the regulators will follow

The EU AI Act's general-purpose AI provisions, the UK AI Bill drafting now in committee, the US executive orders on synthetic content, and the corresponding work in the G7 Hiroshima process are all converging on the same requirement: provenance is mandatory, watermarks are mandatory, consent records are mandatory, and the burden of proof is moving from the receiving public to the emitting system. The standards bodies (C2PA, ISO/IEC AI provenance work, IETF SCITT) are converging on the same primitives. The question is no longer whether the multimodal provenance layer becomes a regulatory requirement; the question is whether the labs ship the layer themselves or whether the layer is imposed on them by statute. The Mickai filing portfolio is built on the assumption that the layer is imposed.

Micky Irons has filed the six patents above, in person, as the sole inventor and sole applicant, without a patent attorney and without a law firm. The filings cover the structural properties any compliant multimodal provenance layer has to satisfy. The portfolio is open to collaboration with regulators, with platform operators, with sovereign-AI procurement teams, and with anyone building inside the regulatory perimeter that is now visibly arriving. Mickai's position is not adversarial to the labs; it is structurally upstream of them. When the regulatory layer lands, the structural primitives have to exist somewhere, and they have to exist in a form that survives adversarial transformation. They exist in this portfolio.

Call to action

If you are building a multimodal pipeline, deploying one inside a regulated environment, writing the procurement spec for one, or drafting the regulatory text that will govern one, Micky Irons is open to collaboration. Two specific surfaces are worth naming.

The /oar surface (Open Audit Record) is the public-facing audit-record format Mickai publishes for any output produced under its provenance chain. It is the format a third-party verifier reads to extract the consent envelope, the per-frame signatures, the watermark verification, and the voice-biometric gate record.

The /verify surface (browser-resident offline post-quantum verifier) is the upcoming consumer-facing verifier that runs entirely in the browser, with no network round trip, against the post-quantum signature scheme used across the portfolio. It is referenced in the upcoming Family-023 patent (GB number to be assigned on filing). The intent is that any reader, on any device, can verify any clip, with no dependence on the platform that hosted it and no dependence on a centralised verification service.

The structural fix to the deepfake problem is not a detector arms race. It is a provenance layer that the labs have not built and the regulators are about to require. The portfolio is filed. The architecture is composed. The work continues from Workington, in person, without a law firm, on the assumption that the structural answer wins because the alternative is the deepfake factory you can already see operating.

Micky Irons, sole inventor and sole applicant on the six filings cited above, is reachable through mickai.co.uk for collaboration enquiries from regulators, platform operators, and procurement teams.

ShareLinkedIn X Hacker News Reddit Mastodon Bluesky Email

Originally published at https://mickai.co.uk/articles/multimodal-ai-needs-provenance-or-its-a-deepfake-factory. If you operate in a regulated sector or want sovereign AI on your own hardware, the audit form on mickai.co.uk is the entry point.