Cloudflare announced new tools Monday that it claims will help end the era of endless AI scraping by giving all sites on its network the power to block bots in one click.
That will help stop the firehose of unrestricted AI scraping, but, perhaps even more intriguing to content creators everywhere, Cloudflare says it will also make it easier to identify which content that bots scan most, so that sites can eventually wall off access and charge bots to scrape their most valuable content. To pave the way for that future, Cloudflare is also creating a marketplace for all sites to negotiate content deals based on more granular AI audits of their sites.
These tools, Cloudflare’s blog said, give content creators “for the first time” ways “to quickly and easily understand how AI model providers are using their content, and then take control of whether and how the models are able to access it.”
That’s necessary for content creators because the rise of generative AI has made it harder to value their content, Cloudflare suggested in a longer blog explaining the tools.
Previously, sites could distinguish between approving access to helpful bots that drive traffic, like search engine crawlers, and denying access to bad bots that try to take down sites or scrape sensitive or competitive data.
But now, “Large Language Models (LLMs) and other generative tools created a murkier third category” of bots, Cloudflare said, that don’t perfectly fit in either category. They don’t “necessarily drive traffic” like a good bot, but they also don’t try to steal sensitive data like a bad bot, so many site operators don’t have a clear way to think about the “value exchange” of allowing AI scraping, Cloudflare said.
That’s a problem because enabling all scraping could hurt content creators in the long run, Cloudflare predicted.
“Many sites allowed these AI crawlers to scan their content because these crawlers, for the most part, looked like ‘good’ bots—only for the result to mean less traffic to their site as their content is repackaged in AI-written answers,” Cloudflare said.
All this unrestricted AI scraping “poses a risk to an open Internet,” Cloudflare warned, proposing that its tools could set a new industry standard for how content is scraped online.
How to block bots in one click
Increasingly, creators fighting to control what happens with their content have been pushed to either sue AI companies to block unwanted scraping, as The New York Times has, or put content behind paywalls, decreasing public access to information.
While some big publishers have been striking content deals with AI companies to license content, Cloudflare is hoping new tools will help to level the playing field for everyone. That way, “there can be a transparent exchange between the websites that want greater control over their content, and the AI model providers that require fresh data sources, so that everyone benefits,” Cloudflare said.
Today, Cloudflare site operators can stop manually blocking each AI bot one by one and instead choose to “block all AI bots in one click,” Cloudflare said.
They can do this by visiting the Bots section under the Security tab of the Cloudflare dashboard, then clicking a blue link in the top-right corner “to configure how Cloudflare’s proxy handles bot traffic,” Cloudflare said. On that screen, operators can easily “toggle the button in the ‘Block AI Scrapers and Crawlers’ card to the ‘On’ position,” blocking everything and giving content creators time to strategize what access they want to re-enable, if any.
Beyond just blocking bots, operators can also conduct AI audits, quickly analyzing which sections of their sites are scanned most by which bots. From there, operators can decide which scraping is allowed and use sophisticated controls to decide which bots can scrape which parts of their sites.
“For some teams, the decision will be to allow the bots associated with AI search engines to scan their Internet properties because those tools can still drive traffic to the site,” Cloudflare’s blog explained. “Other organizations might sign deals with a specific model provider, and they want to allow any type of bot from that provider to access their content.”
For publishers already playing whack-a-mole with bots, a key perk would be if Cloudflare’s tools allowed them to write rules to restrict certain bots that scrape sites for both “good” and “bad” purposes to keep the good and throw away the bad.
Perhaps the most frustrating bot for publishers today is the Googlebot, which scrapes sites to populate search results as well as to train AI to generate Google search AI overviews that could negatively impact traffic to source sites by summarizing content. Publishers currently have no way of opting out of training models fueling Google’s AI overviews without losing visibility in search results, and Cloudflare’s tools won’t be able to get publishers out of that uncomfortable position, Cloudflare CEO Matthew Prince confirmed to Ars.
For any site operators tempted to toggle off all AI scraping, blocking the Googlebot from scraping and inadvertently causing dips in traffic may be a compelling reason not to use Cloudflare’s one-click solution.
However, Prince expects “that Google’s practices over the long term won’t be sustainable” and “that Cloudflare will be a part of getting Google and other folks that are like Google” to give creators “much more granular control over” how bots like the Googlebot scrape the web to train AI.
Prince told Ars that while Google solves its “philosophical” internal question of whether the Googlebot’s scraping is for search or for AI, a technical solution to block one bot from certain kinds of scraping will likely soon emerge. And in the meantime, “there can also be a legal solution” that “can rely on contract law” based on improving sites’ terms of service.
Not every site would, of course, be able to afford a lawsuit to challenge AI scraping, but to help creators better defend themselves, Cloudflare drafted “model terms of use that every content creator can add to their sites to legally protect their rights as sites gain more control over AI scraping.” With these terms, sites could perhaps more easily dispute any restricted scraping discovered through Cloudflare’s analytics tools.
“One way or another, Google is going to get forced to be more fine-grained here,” Prince predicted.
https://arstechnica.com/?p=2051732