wordpress blog stats
Connect with us

Hi, what are you looking for?

Here’s how web publishers can opt out of Google crawlers scraping website data to train AI models

Google announced ‘Google Extended’ tool to help publishers prevent Google’s AI models from scraping their data.

What’s the news: Google announced a new opt-out control for web publishers who want to control how their content is used for training emerging generative AI. The control called ‘Google-Extended’ is a “standalone product token” for publishers to manage whether their sites are used for the benefit of Bard and Vertex AI generative APIs, including future generations of models that power those products. According to The Verge,  while the tool will prevent the data from being used to train AI, crawlers like Googlebot can still scrape and index websites.

“By using Google-Extended to control access to content on a site, a website administrator can choose whether to help these AI models become more accurate and capable over time,” said Google.

Why it matters: For web publishers, the concept of data scraping by AI models raises copyright concerns like loss of due credit or loss of audience. Even as platforms like Twitter are limiting access to its content to prevent such scraping and using of collected data while also using data collected on the platform to train its own AI models, it seems owners of these models are nowadays coming forward with opt-out options for publishers as well. The most recent example is Open AI explaining how a person can restrict its web crawler GPTBot from crawling on a website. Now, Google is also offering such tools, a welcome step for resolving conflicts between AI model owners and web publishers.

Here’s how Google-Extended work: Google explains how a publisher can change their robots.txt file to keep Google’s AI models in check. A robots.txt is a file that tells search engines and other crawlers which parts of a site they can and cannot access. While some bots might follow the instructions, others (bad actors) might not.

Here, the User agent token that publishers have to include in Google’s case is “Google-Extended.” So, websites that don’t want to be crawled by Google for AI training must basically include the following text in their robots.txt file:

User-agent: Google-Extended

Disallow: /

Control tools allow better transparency: As per Google’s blog, the Google-Extended control, available through robots.txt, provides transparency and control that should be made available by all providers of AI models. However, it also warned that web publishers will face an increasing complexity of managing different uses at scale as AI applications expand.

“We’re committed to engaging with the web and AI communities to explore additional machine-readable approaches to choice and control for web publishers,” said Google.

STAY ON TOP OF TECH POLICY: Our daily newsletter with the top story of the day from MediaNama, delivered to your inbox before 9 AM. Click here to sign up today!

Also Read:

Written By

I'm interested in the shaping and strengthening of rights in the digital space. I cover cybersecurity, platform regulation, gig worker economy. In my free time, I'm either binge-watching an anime or off on a hike.

MediaNama’s mission is to help build a digital ecosystem which is open, fair, global and competitive.



Factors like Indus not charging developers any commission for in-app payments and antitrust orders issued by India's competition regulator against Google could contribute to...


Is open-sourcing of AI, and the use cases that come with it, a good starting point to discuss the responsibility and liability of AI?...


RBI Deputy Governor Rabi Shankar called for self-regulation in the fintech sector, but here's why we disagree with his stance.


Both the IT Minister and the IT Minister of State have chosen to avoid the actual concerns raised, and have instead defended against lesser...


The Central Board of Film Certification found power outside the Cinematograph Act and came to be known as the Censor Board. Are OTT self-regulating...

You May Also Like


Google has released a Google Travel Trends Report which states that branded budget hotel search queries grew 179% year over year (YOY) in India, in...


135 job openings in over 60 companies are listed at our free Digital and Mobile Job Board: If you’re looking for a job, or...


By Aroon Deep and Aditya Chunduru You’re reading it here first: Twitter has complied with government requests to censor 52 tweets that mostly criticised...


Rajesh Kumar* doesn’t have many enemies in life. But, Uber, for which he drives a cab everyday, is starting to look like one, he...

MediaNama is the premier source of information and analysis on Technology Policy in India. More about MediaNama, and contact information, here.

© 2008-2021 Mixed Bag Media Pvt. Ltd. Developed By PixelVJ

Subscribe to our daily newsletter
Your email address:*
Please enter all required fields Click to hide
Correct invalid entries Click to hide

© 2008-2021 Mixed Bag Media Pvt. Ltd. Developed By PixelVJ