What’s the news: Google announced a new opt-out control for web publishers who want to control how their content is used for training emerging generative AI. The control called ‘Google-Extended’ is a “standalone product token” for publishers to manage whether their sites are used for the benefit of Bard and Vertex AI generative APIs, including future generations of models that power those products. According to The Verge, while the tool will prevent the data from being used to train AI, crawlers like Googlebot can still scrape and index websites.
“By using Google-Extended to control access to content on a site, a website administrator can choose whether to help these AI models become more accurate and capable over time,” said Google.
Why it matters: For web publishers, the concept of data scraping by AI models raises copyright concerns like loss of due credit or loss of audience. Even as platforms like Twitter are limiting access to its content to prevent such scraping and using of collected data while also using data collected on the platform to train its own AI models, it seems owners of these models are nowadays coming forward with opt-out options for publishers as well. The most recent example is Open AI explaining how a person can restrict its web crawler GPTBot from crawling on a website. Now, Google is also offering such tools, a welcome step for resolving conflicts between AI model owners and web publishers.
Here’s how Google-Extended work: Google explains how a publisher can change their robots.txt file to keep Google’s AI models in check. A robots.txt is a file that tells search engines and other crawlers which parts of a site they can and cannot access. While some bots might follow the instructions, others (bad actors) might not.
Here, the User agent token that publishers have to include in Google’s case is “Google-Extended.” So, websites that don’t want to be crawled by Google for AI training must basically include the following text in their robots.txt file:
Control tools allow better transparency: As per Google’s blog, the Google-Extended control, available through robots.txt, provides transparency and control that should be made available by all providers of AI models. However, it also warned that web publishers will face an increasing complexity of managing different uses at scale as AI applications expand.
“We’re committed to engaging with the web and AI communities to explore additional machine-readable approaches to choice and control for web publishers,” said Google.
STAY ON TOP OF TECH POLICY: Our daily newsletter with the top story of the day from MediaNama, delivered to your inbox before 9 AM. Click here to sign up today!
- Here’s How You Can Block OpenAI’s Web Crawler From Scraping Your Site
- OpenAI’s ChatGPT Will Now Be Able To Provide Current Information
- The Gap Between Responsibility And Liability Of AI
- OpenAI Introduces Its Image Command Feature, Says It Will Refuse Requests For Some Prompts Containing Human Images