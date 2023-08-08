OpenAI on August 7 updated its documentation page explaining how you can restrict its web crawler GPTBot from crawling your site to train the company’s artificial intelligence (AI) models including GPT, the AI model behind the popular ChatGPT.

Why does this matter: AI models are largely trained on huge datasets including data that are scraped from the internet. But in many cases, especially for protecting copyright, a website owner might not be fine with a bot crawling their site. For example, a news website might be averse to the idea because if an AI bot is able to scrape its site for news articles and present the same to a user without attribution, then there is no need for the user to visit the news website, which in turn leads to decrease in ad revenue for the news website. In another example, Twitter recently limited the number of tweets users can see in a day to prevent the scraping of data on their platform by AI bots. To address these concerns, OpenAI is giving website owners more control over what its web crawler can access.

How to disallow GPTBot from crawling your site:

To completely disallow GPTBot from accessing your site, you can add the following to your site’s robots.txt file:

User-agent: GPTBot

Disallow: /

You can also choose to allow GPTBot to access only parts of your site by adding limitations in the robots.txt file like this:

User-agent: GPTBot

Allow: /directory-1/

Disallow: /directory-2/

A robots.txt file is a file on a website that tells search engines and other crawlers which parts of the site they can and cannot access. While some bots might follow the instructions, others (bad actors) might not.

You can also block the IP addresses used by OpenAI’s crawler listed here.

In addition to the above safeguards that you can manually enable, web pages crawled with the GPTBot “are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies,” OpenAI informed.

“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” the company added.

What about content that has been scraped previously? While the method outlined above to disallow OpenAI’s crawler will work going forward, what about existing data scraped from sites that now disallow crawling? Is there any way for a website to request its data to be deleted from OpenAI’s existing datasets?

