ChatClone: when the AI voice on the phone is a deepfake, the attestation has to be the answer.
Voice deepfakes hit production scam infrastructure in 2024 and have not stopped getting better. The defence the industry shipped (telecom-side caller-ID upgrades) is the wrong layer. The right layer is per-utterance cryptographic attestation that the voice on the line is the live human it claims to be. ChatClone is the Mickai sub-component that does this. Patent 09. This is how it works and why it has to live on the user's hardware.
By Q4 2025, UK Finance reported voice-deepfake authorised-push-payment fraud against SMEs alone passed eighty million pounds. The carrier industry's answer was a programme of caller-ID upgrades and STIR/SHAKEN-style signed call attestation. That defends the metadata of the call (which number called whom, when, with what carrier-side trust). It does not defend the audio, which is the part the human listener actually trusts. A spoofed voice over an authentic carrier path bypasses every carrier defence by definition.
The right layer to defend is the audio itself. Every utterance the user produces should be signed, in real time, under a key the user holds and no other party can. Every party that needs to verify the voice queries the verification surface in real time and gets a signed yes-or-no on whether the audio chunk in front of them was produced by the keyholder or by something else. ChatClone is the Mickai sub-component that implements this primitive. It is filed under Patent 09 of the Mickai portfolio at the UK Intellectual Property Office (application UK00004373277, sole inventor Micky Irons). This article is the architecture and the threat model.
What the threat actually is
- Targeted impersonation. The attacker has a clean recording of the target (a podcast, a YouTube interview, a webinar), trains a clone on it, and uses the clone in a live audio session to direct a finance team to wire funds, an IT team to change MFA, or a family member to send money for an emergency that is not happening.
- Pre-recorded coercion. The attacker generates a long-form clip in the target's voice and plays it during a call, with cuts to mute when they need to listen.
- Real-time conversion. The attacker speaks; a model translates the attacker's voice into the target's voice in near real time. This is the 2026 version of the threat, and the latency floor has dropped low enough to fool a finance director on a normal corporate phone call.
- Hybrid voice + video deepfake. A live video call with a synthesised face and a synthesised voice. Carrier defences do nothing; visual cues are inadequate; only an audio-side cryptographic attestation can break the loop.
Why per-utterance, why hardware-bound
Per-call attestation is too coarse. Anyone with thirty seconds of microphone access can pass a single sample test. Per-utterance attestation forces the attacker to forge every short clip independently, which is structurally infeasible because the attestation is keyed.
Hardware-bound is the only way the attestation survives the device-compromise scenario. The signing key has to live in the user's TPM, secure enclave, HSM, or hardware token, with attestation that the key has never left the device. A signing key that lives in software can be exfiltrated by the same malware that drives the impersonation; a signing key that lives in hardware cannot.
How ChatClone works on the call
- The user speaks. The audio enters the Mickai voice front end on the user's hardware. The same acoustic primitives that power the voice biometric (Patent 02) extract a tamper-evident signature of the audio.
- The signature is submitted to the hardware-bound signing key with a timestamp and a session nonce. The hardware emits a signature chunk under ML-DSA-65 (Patent 08, FIPS 204).
- The signed chunk is transmitted alongside the audio (the audio itself is unchanged; the signature rides in a sidecar channel, an out-of-band SIP header, an SDP attribute, or for in-app calls a parallel WebSocket).
- The receiving party's verifier ingests both. It walks the signature chain, verifies it under the user's published verification key, and emits VALID, INVALID, or NOT_PRESENT. Any chunk in the audio that has no signature, or whose signature is invalid, is flagged. The receiving party can refuse to act on unsigned audio.
- Because the audio rides without modification, even if the receiver does not implement ChatClone the call still works exactly as before. Verification is a positive enhancement, not a precondition.
What this gives the user, the receiver, and the bank
- The user. Cannot be impersonated by anyone, anywhere, ever, because the attacker does not hold the hardware-bound key. Any attempt to impersonate them produces audio with no valid attestation; any receiver who checks attestation knows the audio is not from the keyholder.
- The receiver (a colleague, family member, customer, supplier). Verifies in real time that the voice on the line is the live human it claims to be. Refuses to act on instructions in unsigned or invalid-signed audio. The verification UI fits in a small status indicator next to the caller's name.
- The bank or financial institution. Receives a signed attestation alongside any push-payment authorisation. The signature ties the authorisation to the live human, in the live moment, on the user's hardware. APP fraud against the user becomes structurally harder; the bank's fraud surface is reduced; the user retains full control because no third party holds the key.
- The auditor. The signature chain is appended to the post-quantum signed audit ledger (Patent 16, ML-DSA-65). A regulator investigating a disputed transaction has cryptographic certainty about whether the user actually said what was attributed to them.
What this does NOT do
ChatClone does not analyse the audio for deepfake artefacts. The whole industry of deepfake detection is a game of whack-a-mole: every detector has a half-life, every model improves to defeat the current generation, and the user is always one round behind. Cryptographic attestation flips the game: instead of trying to detect the fake, the user proves the genuine. Anything not provably genuine is treated as suspicious, regardless of whether the current generation of detectors would catch it.
ChatClone also does not require any change to the global telephony stack. The audio is unchanged. The signature is sidecar. Any verifier that wants to check it can; verifiers that do not implement it lose nothing. This is an opt-in defence the user controls, not a centralised carrier-side mandate. Sovereign means user-controlled.
Where this sits in Mickai
ChatClone runs as one of the cooperating sub-components inside Mickai, alongside the voice biometric (Patent 02), the hardware-bound actor identity (Patent 12), the post-quantum signing primitive (Patent 08), the multi-brain arbiter (Patent 06), the post-quantum signed audit ledger (Patent 16, Patent 08), and the runtime perimeter on every agent (Patent 21, Sentinel). Mickai is held privately by its founder; the engagement model is direct.
“Sovereign means the voice on the line is provably the user. The key is in the user's hardware. The attestation is per utterance. The receiver decides what to trust.”
Sources
- UK Finance Annual Fraud Report 2025: voice-deepfake APP fraud against UK SMEs.
- FIPS 204 (ML-DSA): NIST post-quantum digital signature standard.
- Mickai patent portfolio: mickai.co.uk/patents (Patent 09, ChatClone anti-deepfake voice attestation).
- Previous Mickai articles: mickai.co.uk/articles/voice-biometric-extreme-environment-verification, mickai.co.uk/articles/the-2026-sovereign-ai-manifesto.