Article · 4 July 2026

California Now Makes You Disclose Your Training Data. Most Vendors Cannot

AB 2013 took effect on 1 January 2026. The first disclosures from the biggest labs proved a point we have been making for two years. If you scraped the open web, you cannot answer the provenance question. If you trained on a sealed, documented corpus, you can.

Author

Micky Irons

Published

4 July 2026

Follow Micky Irons

LinkedIn X

AI governancetraining dataAB 2013model provenanceAI procurement

By Micky Irons

On 1 January 2026, a California law quietly changed the question every AI vendor has to answer. Not "how good is your model," but "what did you train it on, and can you prove it." The bill is AB 2013, the Generative Artificial Intelligence: Training Data Transparency Act. It was signed on 28 September 2024 and it is now live. The first wave of disclosures from the largest labs told me everything I needed to know about the difference between a model you can account for and one you cannot.

I have been saying for two years that provenance is not a compliance chore. It is a moat. AB 2013 just turned that argument into law.

What the law actually requires

AB 2013 applies to developers of generative AI systems or services released on or after 1 January 2022 that are made available to people in California. That scope is wide enough to catch essentially every general-purpose model on the market. If you designed, coded, produced, or substantially modified a covered system, you are a developer under the statute.

The obligation is specific. Before you make a covered system publicly available, you have to post a high-level summary of the datasets used to train it on your website, and you have to update that summary whenever you make a substantial modification. The summary has to cover the sources or owners of the datasets, how those datasets serve the system's purpose, and the number of data points in general ranges. It has to say whether the datasets included data protected by copyright, trademark, or patent, or whether they are in the public domain, and whether the data was purchased or licensed. It has to disclose whether the datasets contained personal information or aggregate consumer information, whether synthetic data was used, what cleaning or processing was applied, and the time periods over which the data was collected and first used.

There are narrow exemptions. Systems used solely for security and integrity, systems that operate aircraft in national airspace, and systems developed for national security or defense purposes and made available only to a federal entity fall outside the requirement. For everyone else, the disclosure is the price of doing business in California.

The enforcement point people keep missing

AB 2013 does not create its own civil penalty schedule, and it does not carve out trade secrets. That combination matters. Because the text does not build a bespoke penalty structure, the practical enforcement route runs through California's Unfair Competition Law, which the Attorney General can wield and which opens the door to broader litigation exposure where harm can be shown. So the vendors who read this as "no fine, no problem" have misread it. A thin or misleading disclosure is not a free pass. It is an unfair-competition claim waiting for a plaintiff, sitting next to the copyright suits already in motion.

Classical marble scene, Dionysus, gold rim light on void black

The tell in the first disclosures

Here is what actually happened when the law went live. The biggest labs posted their summaries, and they were vague on purpose. The reporting on the first disclosures found that developers described only generalised categories: publicly available information, non-public data from third-party partners, data from users, synthetic data. On intellectual property, the summaries hedged. The common formulation acknowledged that training data includes material that "may be protected by copyright" alongside public-domain content, with varying IP status noted but not resolved.

Read that carefully. "May be protected by copyright" is not a description of a corpus. It is an admission that nobody kept a ledger. When your training set is the open web at scale, you genuinely cannot enumerate what is in it, because you never controlled what went in. The high-level summary is thin because the underlying knowledge is thin. That is not a drafting choice. It is the structural consequence of how these models were built.

I do not say this to score points. I say it because in-house counsel and model-risk officers are now going to ask their vendors the same question the law asks, and "may be protected by copyright" is not an answer they can put in front of a regulator or a board.

Why a sealed corpus answers the question and a scrape cannot

Mickai is a Sovereign Intelligence Operating System. Regulated organisations own it and run it inside their own walls, air-gapped where the workload demands it, with a cryptographically signed audit record on every action the system takes. That architecture was not designed for AB 2013. But it is exactly what AB 2013 rewards, because the whole system is built on knowing what went in.

When you train on a sealed, documented corpus, provenance is not something you reconstruct after the fact under legal pressure. It is a property of the build. You know the sources because you chose them. You know the licences because you hold them. You know the collection windows, the processing steps, and whether personal information was ever present, because those facts were recorded at ingestion, not guessed at during a compliance scramble. The disclosure AB 2013 asks for stops being a liability exercise and becomes a description of something you already documented. This is the same principle I have written about under model provenance and content authenticity. The value is not in claiming clean data. It is in being able to show your working.

That is the competitive line the law just drew. A vendor whose model is a black box trained on an uncatalogued scrape can post a summary, but it will be thin, and thin is now a legal exposure. A vendor whose model is trained on a corpus it can account for can post a summary that actually satisfies the statute and survives scrutiny. Same requirement, opposite outcomes, decided entirely by how the model was built before anyone had heard of AB 2013. It is the same argument I keep coming back to on sovereign AI: control over the substrate is not a nice-to-have, it is what makes the hard questions answerable.

Where I am careful to be honest

Two honest caveats, because I will not oversell this. First, AB 2013 does not force anyone off cloud infrastructure, and I am not going to pretend it does. Almost every regime that touches AI, from the EU AI Act to the FCA and PRA regimes to GDPR, permits cloud with the right controls. The genuine no-cloud requirement is workload-specific. Classified environments, ITAR-controlled data, isolated operational technology, cases where a data-protection assessment comes back negative. What is happening in the market is a shift in preference toward sovereignty and provenance, and AB 2013 accelerates it. It does not mandate it.

Second, on our own intellectual property, I state it plainly. Mickai has 104 UK patent applications filed across 13 families, roughly 2,340 claims, with the named inventor Mickarle Wagstaff-Irons. Those are filed and building toward examination and grant. They are not granted, and I will not describe them as granted. The provenance architecture is built and live in the product. The patents around it are in prosecution. Both of those things can be true at once, and I would rather tell you exactly where each one stands than round up.

The takeaway

AB 2013 did not invent the training-data problem. It made the problem legible. For years, "what did you train on" was a question you could deflect with confidence and a good demo. As of 1 January 2026 in California, it is a question you have to answer in writing, on your website, under an enforcement mechanism that runs through unfair-competition law. The vendors who built on an uncatalogued scrape are now answering it with "may be protected by copyright," and everyone reading that answer knows what it means.

If you are in-house counsel or you sit on a model-risk committee, the procurement filter writes itself. Ask your AI vendors for their AB 2013 summary and read how specific it is. Specificity is not a courtesy. It is evidence that the vendor knows what is inside its own model. That is the whole game now, and it is the game Mickai was built to win.

Frequently asked questions

What is California's AB 2013 and when did it take effect?

AB 2013, the Generative Artificial Intelligence: Training Data Transparency Act, was signed on 28 September 2024 and took effect on 1 January 2026. It requires developers of generative AI systems released on or after 1 January 2022 and made available to Californians to post a high-level summary of their training datasets on their website.

What has to be in the disclosure?

The summary must cover the sources or owners of the datasets, how the data serves the system's purpose, the number of data points in general ranges, whether the data is protected by copyright, trademark or patent or is public domain, whether it was purchased or licensed, whether it contained personal or aggregate consumer information, whether synthetic data was used, and the cleaning, processing, and collection timeframes.

Is there a fine for non-compliance?

The statute does not set out its own civil penalty schedule and does not carve out trade secrets. In practice, enforcement runs through California's Unfair Competition Law, which the Attorney General can use and which creates broader litigation exposure. A thin or misleading disclosure is a legal risk, not a free pass.

How does Mickai make this answerable?

Mickai is a Sovereign Intelligence Operating System that regulated organisations own and run inside their own walls, with a signed audit record on every action. Because it is trained on a sealed, documented corpus rather than an open-web scrape, provenance is recorded at ingestion. That means the AB 2013 disclosure describes something already documented, rather than being reconstructed under legal pressure. See our related writing on model provenance and content authenticity, and on why sovereign AI puts the substrate inside your own walls.

ShareLinkedIn X Hacker News Reddit Mastodon Bluesky Email

Originally published at https://mickai.co.uk/articles/california-training-data-transparency-act. If you operate in a regulated sector or want sovereign AI on your own hardware, the audit form on mickai.co.uk is the entry point.