wordpress blog stats
Connect with us

Hi, what are you looking for?

Here’s how you can block OpenAI’s web crawler from scraping your site

AI models, such as OpenAI’s ChatGPT, are trained using large datasets, including data scraped from the internet using crawlers. This, however, can also lead to copyright issues for websites.

OpenAI on August 7 updated its documentation page explaining how you can restrict its web crawler GPTBot from crawling your site to train the company’s artificial intelligence (AI) models including GPT, the AI model behind the popular ChatGPT.

Why does this matter: AI models are largely trained on huge datasets including data that are scraped from the internet. But in many cases, especially for protecting copyright, a website owner might not be fine with a bot crawling their site. For example, a news website might be averse to the idea because if an AI bot is able to scrape its site for news articles and present the same to a user without attribution, then there is no need for the user to visit the news website, which in turn leads to decrease in ad revenue for the news website. In another example, Twitter recently limited the number of tweets users can see in a day to prevent the scraping of data on their platform by AI bots. To address these concerns, OpenAI is giving website owners more control over what its web crawler can access.

How to disallow GPTBot from crawling your site:

To completely disallow GPTBot from accessing your site, you can add the following to your site’s robots.txt file:

User-agent: GPTBot
Disallow: /

You can also choose to allow GPTBot to access only parts of your site by adding limitations in the robots.txt file like this:

Advertisement. Scroll to continue reading.

User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/

A robots.txt file is a file on a website that tells search engines and other crawlers which parts of the site they can and cannot access. While some bots might follow the instructions, others (bad actors) might not.

You can also block the IP addresses used by OpenAI’s crawler listed here.

In addition to the above safeguards that you can manually enable, web pages crawled with the GPTBot “are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies,” OpenAI informed.

“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” the company added.

What about content that has been scraped previously? While the method outlined above to disallow OpenAI’s crawler will work going forward, what about existing data scraped from sites that now disallow crawling? Is there any way for a website to request its data to be deleted from OpenAI’s existing datasets?

Advertisement. Scroll to continue reading.

STAY ON TOP OF TECH POLICY: Our daily newsletter with the top story of the day from MediaNama, delivered to your inbox before 9 AM. Click here to sign up today!


Also Read

Written By

Free Reads

News

Vaishnaw's remarks come a day after Google removed apps belonging to Matrimony.com, Info Edge (Naukri and 99 Acres), Shaadi.com, Altt, Truly Madly, Stage, Quack...

News

Paytm has started distancing itself from PPBL in light of the current negative spotlight on PPBL.

News

The move can be seen as an attempt by Paytm to distance itself from the troubled Paytm Payments Bank, which has been significantly restricted...

MediaNama’s mission is to help build a digital ecosystem which is open, fair, global and competitive.

Views

News

NPCI CEO Dilip Asbe recently said that what is not written in regulations is a no-go for fintech entities. But following this advice could...

News

Notably, Indus Appstore will allow app developers to use third-party billing systems for in-app billing without having to pay any commission to Indus, a...

News

The existing commission-based model, which companies like Uber and Ola have used for a long time and still stick to, has received criticism from...

News

Factors like Indus not charging developers any commission for in-app payments and antitrust orders issued by India's competition regulator against Google could contribute to...

News

Is open-sourcing of AI, and the use cases that come with it, a good starting point to discuss the responsibility and liability of AI?...

You May Also Like

News

Google has released a Google Travel Trends Report which states that branded budget hotel search queries grew 179% year over year (YOY) in India, in...

Advert

135 job openings in over 60 companies are listed at our free Digital and Mobile Job Board: If you’re looking for a job, or...

News

By Aroon Deep and Aditya Chunduru You’re reading it here first: Twitter has complied with government requests to censor 52 tweets that mostly criticised...

News

Rajesh Kumar* doesn’t have many enemies in life. But, Uber, for which he drives a cab everyday, is starting to look like one, he...

MediaNama is the premier source of information and analysis on Technology Policy in India. More about MediaNama, and contact information, here.

© 2008-2021 Mixed Bag Media Pvt. Ltd. Developed By PixelVJ

Subscribe to our daily newsletter
Name:*
Your email address:*
*
Please enter all required fields Click to hide
Correct invalid entries Click to hide

© 2008-2021 Mixed Bag Media Pvt. Ltd. Developed By PixelVJ