Article · 4 July 2026

The GDPR Is About To Let You Train on Legitimate Interest. That Raises the Provenance Bar, Not Lowers It

An easier legal basis for training on personal data moves the burden downstream, to proving exactly what went into the model and why

Author

Micky Irons

Published

4 July 2026

Follow Micky Irons

LinkedIn X

GDPRAI traininglegitimate interestdata provenanceDigital Omnibus

!Prometheus bound in gold chains against a black void, a single ember of stolen fire held out, marble and gold, cinematic

By Micky Irons

The European Commission has proposed something the AI industry has wanted for years. Under the Digital Omnibus package published on 19 November 2025, a new Article 88c would confirm that developing and operating an AI model can rest on legitimate interest, not only consent. A companion provision would permit the necessary processing of special-category data for bias detection and correction, even outside high-risk systems, under strict safeguards. Read one way, that is a gift. Read carefully, it is an invoice. The moment your legal basis stops being "the data subject said yes" and becomes "we judged our interest legitimate," the thing that decides whether you win or lose an enforcement fight is no longer a consent checkbox. It is your ability to prove, dataset by dataset, exactly what went into the model and why.

We build the system that produces that proof. So let me be precise about what is changing and what it demands.

What the proposal actually says

Three things matter, and the safeguards are the whole point.

First, Article 88c would establish that AI development and operation may generally be pursued as a legitimate interest where appropriate. It does not say training is a free-for-all. Legitimate interest is a balancing test, and the proposal keeps the balance loaded with obligations: data minimisation, a documented risk assessment, and an unconditional right for data subjects to object to the processing. Unconditional means you cannot argue them out of it. If they object, you stop for their data.

Second, the special-category provision extends an ability that already existed for high-risk AI systems, processing sensitive data to detect and mitigate bias, to systems that are not high-risk, subject to safeguards. This is narrow and it is conditional. It is permission to touch sensitive attributes for the specific purpose of measuring and correcting discrimination, not a general licence to ingest health, ethnicity, or belief data because it is useful.

Third, none of this is law yet. The Commission proposed it. Finalisation is expected around mid to late 2026, and the co-legislators can move it. Building your data strategy on the current draft as though it were settled is its own risk.

An easier basis raises the evidence bar, it does not remove it

Consent was administratively painful but evidentially simple. You either had a valid, recorded consent for a record or you did not. Legitimate interest inverts that. The basis is easier to claim and far harder to defend, because defending it means reconstructing a judgement after the fact.

When a regulator or a claimant asks why a given person's data sat in your training corpus, "legitimate interest" is the start of the conversation, not the end. You will be asked to show the legitimate-interest assessment you ran before processing, the data-minimisation decisions you made, how you honoured objections, and, for any sensitive data, that it entered solely for bias detection under the safeguards. Every one of those answers is a provenance question. Where did this record come from, on what basis was it admitted, what was done to it, and can you show the log without editing it after you got the letter.

This is the counter-intuitive part that most teams will get wrong. They will treat 88c as friction removed and scale their pipelines faster. What it actually does is move the burden from getting permission upstream to demonstrating diligence downstream, permanently, for the life of every model you ship.

Classical marble scene, Themis, gold rim light on void black

Provenance is not a document, it is a pipeline property

You cannot bolt this on with a spreadsheet written the week the regulator calls. A defensible answer requires that provenance be a property of the pipeline itself: captured at ingestion, immutable once written, and attributable to a specific human decision.

That is what a Sovereign Intelligence Operating System is built to do. Mickai runs inside your own walls, air-gapped where the workload demands it, and writes a cryptographically-signed audit record on every action the system takes. When a dataset is admitted for training, the legal basis, the minimisation applied, and the operator who approved it are signed into the record at that moment. When someone exercises the unconditional right to object, the removal is an event with a signature and a timestamp, not a promise in a policy. When sensitive data is used for bias detection, the record shows the purpose, the safeguard, and the boundary, so "we only touched it for fairness testing" is a demonstrable fact rather than an assertion.

The reason to own this rather than rent it is straightforward, and it is a preference argument, not a prohibition one. For most regimes there is no legal bar on cloud at all. DORA, the FCA and PRA rulebooks, the EBA guidelines, the NHS DSP Toolkit, and GDPR itself all permit cloud with the right controls. The genuine no-cloud position is workload-level: classified or SECRET-and-above material, ITAR-controlled data, isolated OT and SCADA environments, and processing where a DPIA lands negative. Everywhere else the case for sovereignty is about control, cost, and cutting your data-exfiltration surface. That preference is why a register-backed market of roughly 16,092 regulated and large-private institutions across the UK and EU is moving toward owned intelligence, and why the enterprise-AI-platform software category is forecast to grow from about £11.7bn in 2024 toward £39.7bn by 2030. If your training provenance lives in a third party's control plane, your evidence lives under their retention policies, their outages, and their subpoenas. We think the organisation whose name is on the model should hold the record.

Precision on the safeguards, because they are the exposure

Two cautions for teams moving on this. The unconditional right to object is not a soft preference you can rate-limit. If your architecture cannot cleanly identify and remove one person's contribution from a training set, you do not have a compliant pipeline, you have a liability waiting for its first objection. Design for removal before you design for scale.

And the bias-detection permission is a scalpel, not a door. Processing special-category data under it is lawful only for detecting and correcting bias, under the safeguards, and the burden of showing you stayed inside that purpose is yours. An audit trail that proves the sensitive data never leaked into general training is the difference between a defensible fairness programme and a special-category breach.

The takeaway

Legitimate interest for AI training is a genuine easing of the entry requirement and a genuine tightening of the proof requirement. The teams that will thank the Digital Omnibus are the ones who already run owned, logged, signable training pipelines, because for them the new basis is simply a lower door they can walk through with their evidence already in hand. The teams that will regret it are the ones who read "legitimate interest" as "we can stop keeping receipts." You now need better receipts than ever. We built the machine that keeps them.

Frequently asked questions

Does the Digital Omnibus mean we can train on any personal data without consent?

No. The proposed Article 88c would let AI development rest on legitimate interest where appropriate, but that basis carries a balancing test plus mandatory safeguards: data minimisation, a documented risk assessment, and an unconditional right for individuals to object. It lowers the barrier to entry while raising the standard of proof you must maintain.

Can we use sensitive data like ethnicity or health for AI now?

Only for the narrow purpose the proposal permits: detecting and correcting bias, under strict safeguards, now extended beyond high-risk systems. It is not a general licence to ingest special-category data because it is useful. You carry the burden of proving the data entered solely for that purpose and never leaked into general training, which is a provenance and audit question.

Is this already law?

Not yet. The Commission proposed the package on 19 November 2025 and finalisation is expected around mid to late 2026, with the text still open to change. Treat it as direction, not settled ground, and build pipelines that would satisfy the safeguards as drafted rather than betting on a specific final wording.

Why does an easier legal basis make provenance more important, not less?

Because consent was simple to evidence and legitimate interest is not. Defending legitimate interest means reconstructing your assessment, your minimisation, your objection handling, and your safeguards after the fact. That is only credible if your pipeline captured it at the time, immutably. See our related work on the Mickai signed audit record and on why owned, sovereign training pipelines beat rented ones for regulated data.

Sources: Latham & Watkins, Sidley Data Matters, IAPP

ShareLinkedIn X Hacker News Reddit Mastodon Bluesky Email

Originally published at https://mickai.co.uk/articles/legitimate-interest-for-ai-training-the-gdpr-change-and-the-provenance-burden-it-shifts. If you operate in a regulated sector or want sovereign AI on your own hardware, the audit form on mickai.co.uk is the entry point.