Google's AI Is Scraping Even Sites That Ask to Be Ignored

Don't want a tech conglomerate to train its AI model on your website? Too bad — Google will do it anyway. That, more or less, is what the Silicon Valley behemoth just admitted to in court. As Bloomberg reports, Google said that while it does give publishers the option to opt out of large language model training done by its AI lab, Google DeepMind, it does not extend this courtesy to training done by other parts of the company — including the unit in charge of its dominant search engine, which has its own AI products like the much-maligned AI […]

May 5, 2025 - 22:10
 0
Google's AI Is Scraping Even Sites That Ask to Be Ignored
A Google DeepMind executive admitted that the company still trains its AI models on data collected on sites that opted out of AI training.

Don't want a tech conglomerate to train its AI model on your website? Too bad — Google will do it anyway, thanks to a very convenient workaround.

At least, that's more or less what the Silicon Valley behemoth just admitted to in court.

As Bloomberg reports, Google said that while it does give publishers the option to opt out of large language model training done by its AI lab, DeepMind, it doesn't extend to AI efforts by other parts of the company — including the unit in charge of its dominant search engine, which has its own AI products like the much-maligned AI Overviews.

The admission was made by Eli Collins, a vice president at DeepMind, when he was called as a witness during a federal antitrust trial in Washington. Diana Aguilar, a Department of Justice lawyer, grilled Collins about the glaring loophole being used to develop the company's chatbot, Gemini.

"Once you take the Gemini and put it inside the search org, the search org has the ability to train on the data that publishers had opted out of training, correct?"  Aguilar asked, per Bloomberg.

"Correct — for use in search," Collins confirmed.

The scale of this scraping is staggering. An internal document from 2024 cited by Aguilar showed that Google had collected a total of 160 billion tokens — short units of text — in AI training data. Half of the tokens were stated to have been removed since they came from publishers who opted out of AI training. But based on Collin's new testimony, those 80 billion tokens are still being used to train AI at Google, just not at DeepMind itself.

In another example of Google slipperiness, there actually is one way to opt-out of having your website trawled by an AI: by opting out of being indexed in Google's search engine entirely. That's a death sentence for any website, a choice that's really no choice at all.

Google implies this is merely a consequence of how the widely used "robots.txt" file works, which instructs web crawlers — the bots that collect data for search engines and now AI training efforts — on what parts of a website they can access.

"Google has a separate way for publishers to manage their content in Search via the well-established robots.txt web standard," a Google spokesperson said in a statement, per Bloomberg.

Last year, a federal judge ruled that Google holds an illegal monopoly over the search engine market, abusing its dominance to shut out competitors — like by paying companies billions of dollars to set Google as the default search engine on their devices and services — and unfairly raising ad prices.

US regulators are still deciding how to break up the monopoly. Some of the options being considered include forcing Google to sell its popular Chrome browser — with its AI competitor OpenAI circling like a vulture — or banning the default search engine agreements made with other companies, or forcing Google to share some of its data.

Now, the federal suit is also highlighting how Google leverages its search engine dominance — constantly maintaining a roughly 90 percent market share in the US —  to get what it wants with its AI initiatives. If by telling websites the only way to avoid its AI data scraping is by not showing up in a Google search, cutting them off from that 90 percent of web traffic, that might be evidence of a monopoly. The education website Chegg argued as much in a recent lawsuit, claiming that Google was using its monopoly to pressure it to let Google train its AI tools on its content for free.

More on Google: "You Can’t Lick a Badger Twice": Google's AI Is Making Up Explanations for Nonexistent Folksy Sayings

The post Google's AI Is Scraping Even Sites That Ask to Be Ignored appeared first on Futurism.