Article · 4 July 2026

The Publisher Lawsuits Turned Data Provenance Into A Balance-Sheet Question

When five publishers sued Meta and Anthropic settled for 1.5 billion dollars, training lineage stopped being an ethics footnote and became a diligence line item

Author

Micky Irons

Published

4 July 2026

Follow Micky Irons

LinkedIn X

data provenanceAI copyrightmodel sourcingdiligencesovereign AI

On 5 May 2026, five of the largest names in publishing walked into the Southern District of New York and put a number on a problem the AI industry had spent three years waving away. Hachette, Macmillan, McGraw Hill, Elsevier and Cengage, joined by best-selling author Scott Turow, filed a proposed class action against Meta and Mark Zuckerberg. Their claim is that Llama was trained on millions of books and journal articles pulled from pirate repositories like LibGen and Anna's Archive. The complaint does not just allege copying. It alleges that Meta ran scripts to strip copyright management information from the pirated works while leaving that same information intact on public-domain texts, a pattern the plaintiffs frame as deliberate concealment rather than neutral data cleaning.

Eight months earlier, in September 2025, Anthropic agreed to pay 1.5 billion dollars to settle Bartz v. Anthropic, the largest copyright recovery in United States history. The settlement covers roughly 500,000 pirated works at about 3,000 dollars a book, and Judge William Alsup granted it preliminary approval on 25 September 2025.

Put those two events next to each other and you see the shape of a new reality. Provenance is no longer an ethics footnote. It is a line on the balance sheet.

The ruling that changed the question

The most important thing to understand about the Anthropic case is not the size of the cheque. It is what Judge Alsup actually held in June 2025, in the Northern District of California. He ruled that training a model on books Anthropic had lawfully bought and digitised was transformative and protected by fair use. Then he drew a hard line. Downloading more than seven million books from shadow libraries and keeping them in a central library was, in his words, irredeemably infringing.

The training was fine. The sourcing was not.

That distinction is the whole game. It means the legal risk in a foundation model does not live in the architecture or the weights in the abstract. It lives in the answer to a single question: where did the data come from, and can you prove it? For years that question had no dollar figure attached. Now it has two, one measured in a settlement north of a billion and a half, and one measured in whatever the publishers eventually extract from Meta.

Why this lands on the buyer, not just the builder

If you are a lab training your own model, this is your problem to solve at the source. But most organisations do not train foundation models. They buy them, license them, fine-tune them, and deploy them into regulated workflows. The comfortable assumption has been that liability stops with the vendor.

I do not think that assumption survives contact with these cases. When you deploy a model into your own products and your own decisions, you inherit its lineage whether you documented it or not. A model trained on an undisclosed pile of pirated works is a model carrying a contingent liability that nobody priced. Corporate development teams already run diligence on data lineage, on open-source licence exposure, on privacy posture. Training provenance now belongs on exactly that list. When you acquire a company built on a model, or sign a multi-year deployment, "what was this trained on and who owns it" is a question you ask before you sign, not after you are named in a complaint.

The copyright-management-information allegation sharpens this further. Stripping that information is not a passive act of scraping. It is an alleged attempt to obscure origin, and the plaintiffs have pleaded it as its own count carrying its own damages. For a diligence team, that is the difference between a licensing gap you can remediate and a concealment claim you cannot.

Classical marble scene, Demeter, gold rim light on void black

What documented lineage actually buys you

Here is the case I want to make. The models worth deploying into a regulated environment are the ones whose training lineage is documented, owned, and defensible. Not because it is virtuous, though it is, but because it is the only version of the story that survives a diligence review or a discovery request.

This is a principle we built Mickai around from the beginning. Mickai is a Sovereign Intelligence Operating System, a SIOS that a regulated organisation owns and runs inside its own walls, air-gapped if it needs to be, with a cryptographically-signed audit record on every action. When we talk about sourcing our own models from licensed, permissively-licensed and owned foundations, this is why. A sovereign system whose intelligence rests on an undocumented data pile is not sovereign. It is a liability wearing a sovereignty label.

Documented lineage does three concrete things for a buyer. It converts an unknown contingent liability into a known, bounded one. It gives your legal team something to stand on if a rights-holder comes calling. And it lets you keep operating a model even if a competitor's model gets pulled into a settlement, because your provenance is not entangled with theirs.

The diligence line item, spelled out

If I were running corporate development or legal diligence today, I would refuse to treat "trained on public data" as an answer. I would ask for the sourcing chain: what corpora, acquired under what licence, from what counterparty, with what documentation. I would ask whether copyright management information was preserved. I would ask whether the vendor can indemnify against a training-data claim, and I would read that indemnity carefully, because an indemnity from a vendor who cannot survive a nine-figure judgment is decoration.

None of this rests on a legal bar that does not exist. To be honest about the market, almost every regime, from the EU AI Act to the FCA and PRA rulebooks to GDPR, permits cloud AI with the right controls in place. The genuine no-cloud requirement is workload-specific: classified material, ITAR-controlled data, isolated operational and control environments, or a case where a data protection impact assessment comes back negative. The broader pull toward owned, documented, sovereign systems is mostly preference, driven by exactly the kind of risk these lawsuits expose. But preference backed by a billion-dollar precedent tends to harden into policy.

The takeaway

The Meta suit will take one to three years to resolve, the way the Anthropic case ran roughly two years from filing to settlement. You do not have to wait for the verdict to act on what it means. The precedent is already set: training can be fair use, but pirated or undocumented sourcing is not, and the cost of getting it wrong is now measured in ten figures.

Provenance moved from the ethics slide to the balance sheet. The models worth owning are the ones that can prove where they came from. Everything else is a liability you have not finished pricing.

Frequently asked questions

Does deploying someone else's model expose me to copyright liability?

It can. Judge Alsup's ruling turned on how training data was sourced, not on who did the training. If you deploy a model built on pirated or undocumented data into your own products, that lineage travels with the model. This is why training provenance now belongs in acquisition and deployment diligence, not just in the vendor's own risk register. It sits alongside the data-provenance questions we treat as first-class in any sovereign deployment.

Is training AI on copyrighted books illegal now?

Not inherently. The June 2025 ruling in Bartz v. Anthropic held that training on lawfully acquired books was transformative fair use. What it held against was maintaining a library of pirated copies, which it called irredeemably infringing. The legality hinges on how the data was obtained, which is precisely why documented lineage matters.

What should a diligence team actually ask about a model?

Ask for the sourcing chain: which corpora, under which licences, from which counterparties, with what documentation. Ask whether copyright management information was preserved. Ask for a training-data indemnity and assess whether the vendor could actually honour it. Treat "trained on public data" as a non-answer.

How does Mickai approach this?

We build Mickai as an owned, air-gapped SIOS with a signed audit record on every action, and we source our own models from licensed and owned foundations so the lineage is documented and defensible. You can read more in our writing on data provenance and on model sourcing for regulated deployment.

By Micky Irons

ShareLinkedIn X Hacker News Reddit Mastodon Bluesky Email

Originally published at https://mickai.co.uk/articles/publisher-lawsuits-provenance-liability. If you operate in a regulated sector or want sovereign AI on your own hardware, the audit form on mickai.co.uk is the entry point.