AI models have a voracious appetite for data. Keeping up to date with information to present to users is a challenge. And so companies at the vanguard of AI appear to have hit on an answer: crawling the web—constantly.
But website owners increasingly don’t want to give AI firms free rein. So they’re regaining control by cracking down on crawlers.
To do this, they’re using robots.txt, a file held on many websites that acts as a guide to how web crawlers are allowed—or not—to scrape their content. Originally designed as a signal to search engines as to whether a website wanted its pages to be indexed or not, it has gained increased importance in the AI era as some companies allegedly flout instructions.
In a new study, Nicolas Steinacker-Olsztyn, a researcher at Saarland University and his colleagues analyzed how different websites treated robots.txt—and whether there was a difference between sites measured as reputable versus not reputable, specifically in terms of whether or not they allowed crawling. For many AI companies, “It’s kind of a ‘do now and ask for forgiveness later’ thing,” Steinacker-Olsztyn says.
In the study, more than 4,000 sites were checked for their responses to 63 different AI-related user agents, including GPTBot, ClaudeBot, CCBot, and Google-Extended—all of which are used by AI companies in their effort to soak up information.
The websites were then divided between reputable news outlets or misinformation sites, using ratings devised by Media Bias/Fact Check, an organization that categorizes news sources depending on their credibility and the factuality of their reporting.
Across all 4,000 sites assessed, around 60% of those deemed to be reputable news websites blocked at least one AI crawler from accessing their information; among misinformation sites, only 9.1% did so.
The average reputable site blocks more than 15 different AI agents through its robots.txt file. Misinformation sites, by contrast, tend not to shut out the crawlers at all.
“The biggest takeaway is that the reputable news websites keep well up-to-date with the evolving ecosystem as it pertains to these major AI developers and their practices,” Steinacker-Olsztyn says.
Over time, the gap between those who are willing to let bots crawl their sites and those that aren’t is widening. From September 2023 to May 2025, the proportion of platforms locking out crawlers increased from 23% to 60%, while the share of sites peddling misinformation stayed flat, the study found.
The result, Steinacker-Olsztyn says, is that less reputable content is being hoovered up by and then spat out of AI models used routinely by hundreds of millions of people. “Increasingly these models are also being used simply for information retrieval, replacing traditionally used options such as search engines or Google,” Steinacker-Olsztyn adds.
The conundrum over legitimate data
For AI models to stay up-to-date on current events, they are trained on reputable sites, which is exactly what these sites don’t want.
The war over copyright and access to training data between AI companies and news sites is increasingly spilling into courts—The New York Times’s lawsuit against OpenAI, the makers of ChatGPT, for example, carried on into last week.
Those lawsuits are prompted by allegations that AI companies are illegally scraping data on news websites to act as regularly updated, ground-truth-based training data for the models powering their AI chatbots. In addition to litigating their disputes, reputable news websites are blocking AI crawlers.
That’s good for their businesses and rights. But Steinacker-Olsztyn is concerned about the broader impact. “If reputable news is increasingly making this information unavailable, then this gives reason to believe this can affect the reliability of these models,” he explains. “Going forward, this is changing the percentage of legitimate data that they have access to.”
In essence: It doesn’t matter to an AI crawler whether it’s viewing The New York Times or a disinformation website run out of Hoboken. They’re both training data, and if one is easier to access than the other, that’s all that matters.
Not everyone is quite so certain about the negative impact of blocking crawlers.
Felix Simon, a research fellow in AI and digital news at the University of Oxford-based Reuters Institute for the Study of Journalism, says he wasn’t surprised to learn that sites trafficking in misinformation would want to be crawled, “whereas traditional publishers have an incentive at this point to prevent such scraping.” Some of these traditional publishers, he adds, still allow some scraping “for a plethora of reasons.”
Simon also cautions that just because misinformation sites are more likely to open their doors to AI crawlers, it doesn’t necessarily mean that they’re polluting the information space as much as we may fear.
“AI developers filter and weigh data at various points of the system training process and at inference time,” he says. “One would hope that by the same means by which the authors have been able to identify untrustworthy websites, AI developers would be able to filter out such data.”
The final deadline for Fast Company’s World Changing Ideas Awards is Friday, December 12, at 11:59 p.m. PT. Apply today.


