Cool Site Shows Exactly Which Books Zuckerberg's Minions Illegally Downloaded to Train Meta's AI
For all the revolutionary change [dont love the flow but stuck on that phrase] artificial intelligence promises, it also makes some lofty demands. For starters, AI is insanely power hungry, literally. Powering the datacenters that make scalable AI possible takes forest-loads of energy, not to mention hardware and cooling infrastructure. That stuff all costs a lot of money, making AI a huge money pit for tech spending. That's had a big effect on our economy, as the tiniest bit of AI hype can send huge shockwaves through Wall Street and beyond. But it's also greedy in less noticeable ways.


For all the revolutionary change artificial intelligence promises, it also makes lofty demands. For starters, AI is extraordinarily power hungry. Generating all the electricity that AI datacenters consume takes forest-loads of energy, not to mention hardware and cooling infrastructure. That stuff all costs a lot, making AI a huge money pit. That's had a big effect on our economy, as the tiniest bit of AI hype can send huge shockwaves through Wall Street and beyond.
But AI's also greedy in less noticeable ways: namely, for your data.
The large language models (LLMs) that underpin products like OpenAI's ChatGPT, for instance, need to devour enormous datasets of written words to fine tune an algorithm to follow the rules of language. They're so hungry for raw data, in fact, that original material for these algorithms to gobble up is becoming hard to come by.
"We’re literally running out of text in the universe to train these systems on," said computer science scholar Stuart Russell back in 2023. Now in 2025, the well is all but drying up.
Meta, the parent company of Facebook and Instagram, has inadvertently pulled the curtain back on what it looks like to ingest all that data.
In January, Meta lost a huge fight with a group of authors who sued the company for using their books to train its AI. The case uncovered the fact that Meta had illegally downloaded an infamous pirate library, LibGen, to procure millions of legally protected texts. Those books were then fed to Meta's LLM, Llama, after software engineers got approval from the Zuck himself. In other words, one of the largest companies in the world didn't even bother to pay for a single copy of each book it used to build its AI.
This week, The Atlantic compiled a search engine that could trawl the LibGen files and uncover which books, exactly, were scraped by Meta. The scope of Meta's data harvesting operation is extensive, spanning over 7.5 million books and some 81 million academic papers, on top of work published by museums, architects, and artists.
The suit was led by authors like Ta-Nehisi Coates and Sarah Silverman, who had an idea of Meta's data piracy thanks to a previous 2023 lawsuit. But the new search tool is now enabling writers and scholars to see what work, exactly, was pirated to train Meta's for-profit LLM — resulting in plenty of discourse around copyright laws, AI ethics, and media piracy.
"My book is in here — and, good! LibGen makes texts available to people who might not otherwise have access," said Wired writer Justin Ling. "The problem, imo, isn't LibGen making content available for free: It's Meta stealing that material for profit."
Whether Meta will need to make these writers whole remains to be seen, as a decision isn't expected until summer. Regardless, the damage is already done — Llama is running wild and free on platforms like Facebook, Instagram, and WhatsApp — a telling moment for the future of data in a world dominated by big tech.
More on AI copyright law: All AI-Generated Material Must Be Labeled Online, China Announces
The post Cool Site Shows Exactly Which Books Zuckerberg's Minions Illegally Downloaded to Train Meta's AI appeared first on Futurism.