Both OpenAI and Google have released guidance for website owners who do not want the two companies using the content of their sites to train the company’s large language models (LLMs). We’ve long been supporters of the right to scrape websites—the process of using a computer to load and read pages of a website for later analysis—as a tool for research, journalism, and archivers. We believe this practice is still lawful when collecting training data for generative AI, but the question of whether something should be illegal is different from whether it may be considered rude, gauche, or unpleasant. As norms continue to develop around what kinds of scraping and what uses of scraped data are considered acceptable, it is useful to have a tool for website operators to automatically signal their preference to crawlers. Asking OpenAI and Google (and anyone else who chooses to honor the preference) to not include scrapes of your site in its models is an easy process as long as you can access your site’s file structure.
We’ve talked before about how these models use art for training, and the general idea and process is the same for text. Researchers have long used collections of data scraped from the internet for studies of censorship, malware, sociology, language, and other applications, including generative AI. Today, both academic and for-profit researchers collect training data for AI using bots that go out searching all over the web and “scrape up” or store the content of each site they come across. This might be used to create purely text-based tools, or a system might collect images that may be associated with certain text and try to glean connections between the words and the images during training. The end result, at least currently, is the chatbots we’ve seen in the form of Google Bard and ChatGPT.
It would ease many minds for other companies with similar AI products, like Anthropic, Amazon, and countless others, to announce that they’d respect similar requests.
If you do not want your website’s content used for this training, you can ask the bots deployed by Google and Open AI to skip over your site. Keep in mind that this only applies to future scraping. If Google or OpenAI already have data from your site, they will not remove it. It also doesn’t stop the countless other companies out there training their own LLMs, and doesn’t affect anything you’ve posted elsewhere, like on social networks or forums. It also wouldn’t stop models that are trained on large data sets of scraped websites that aren’t affiliated with a specific company. For example, OpenAI’s GPT-3 and Meta’s LLaMa were both trained using data mostly collected from Common Crawl, an open source archive of large portions of the internet that is routinely used for important research. You can block Common Crawl, but doing so blocks the web crawler from using your data in all its data sets, many of which have nothing to do with AI.
There’s no technical requirement that a bot obey your requests. Currently only Google and OpenAI who have announced that this is the way to opt-out, so other AI companies may not care about this at all, or may add their own directions for opting out. But it also doesn’t block any other types of scraping that are used for research or for other means, so if you’re generally in favor of scraping but uneasy with the use of your website content in a corporation’s AI training set, this is one step you can take.
Before we get to the how, we need to explain what exactly you’ll be editing to do this.
What’s a Robots.txt?
In order to ask these companies not to scrape your site, you need to edit (or create) a file located on your website called “robots.txt.” A robots.txt is a set of instructions for bots and web crawlers. Up until this point, it was mostly used to provide useful information for search engines as their bots scraped the web. If website owners want to ask a specific search engine or other bot to not scan their site, they can enter that in their robots.txt file. Bots can always choose to ignore this, but many crawling services respect the request.
This might all sound rather technical, but it’s really nothing more than a small text file located in the root folder of your site, like “https://www.example.com/robots.txt.” Anyone can see this file on any website. For example, here’s The New York Times’ robots.txt, which currently blocks both ChatGPT and Bard.
If you run your own website, you should have some way to access the file structure of that site, either through your hosting provider’s web portal or FTP. You may need to comb through your provider’s documentation for help figuring out how to access this folder. In most cases, your site will already have a robots.txt created, even if it’s blank, but if you do need to create a file, you can do so with any plain text editor. Google has guidance for doing so here.
EFF will not be using these flags because we believe scraping is a powerful tool for research and access to information.
What to Include In Your Robots.txt to Block ChatGPT and Google Bard
With all that out of the way, here’s what to include in your site’s robots.txt file if you do not want ChatGPT and Google to use the contents of your site to train their generative AI models. If you want to cover the entirety of your site, add these lines to your robots.txt file:
You can also narrow this down to block access to only certain folders on your site. For example, maybe you don’t mind if most of the data on your site is used for training, but you have a blog that you use as a journal. You can opt out specific folders. For example, if the blog is located at yoursite.com/blog, you’d use this:
As mentioned above, we at EFF will not be using these flags because we believe scraping is a powerful tool for research and access to information; we want the information we’re providing to spread far and wide and to be represented in the outputs and answers provided by LLMs. Of course, individual website owners have different views for their blogs, portfolios, or whatever else you use your website for. We’re in favor of means for people to express their preferences, and it would ease many minds for other companies with similar AI products, like Anthropic, Amazon, and countless others, announce that they’d respect similar requests.