wordpress blog stats
Connect with us

Hi, what are you looking for?

Views: Can generative AI collect our data from the Internet?

Is it safe to consider all “publicly available data” as public?

By Sreenidhi Srinivasan and Pallavi Sondhi 

Chat GPT can write sonnets, code websites, and even pass the bar exam. It learned how to do this by training on huge amounts of data. A lot of this data is personal information about individuals scraped from the Internet, often without them knowing.

Catching on to this, last month, Italy’s data protection regulator stopped Chat GPT’s operations over a breach of their data norms.

India is still finalising its data protection law. Against the backdrop of Italy’s action, we discuss how Chat GPT would fare under India’s proposed law, and if there are lessons for us to draw from this episode.

STAY ON TOP OF TECH POLICY: Our daily newsletter with the top stories of the day from MediaNama, delivered to your inbox before 9 AM. Click here to sign up today! 

Chat GPT under the scanner across the EU 

Advertisement. Scroll to continue reading.

Italy’s ban on ChatGPT was prompted by a few reasons:

  • There was no legal basis to justify the massive collection of data to train Chat GPT’s algorithms.
  • Open AI did not have appropriate age-gating mechanisms to ensure that children’s data was not collected to train algorithms.
  • The company didn’t give people adequate notice before collecting their data.
  • Chat GPT gave out factually incorrect information.

Italy had also earlier restricted “Replika”, an AI-powered chatbot, over similar grounds. Taking a cue from Italy, regulators in Germany, Spain, France, and Ireland are exploring actions.

Italy has now asked OpenAI to abide by certain norms for the ban to be lifted. Open AI must publish information about its data processing and must clarify the legal basis for processing personal data for training its AI. It must allow users to seek correction of inaccurate data or its deletion and allow users to object to OpenAI’s use of their personal data to train its algorithms.

While Italy’s approach raises several interesting questions, we focus on one key issue – training AI  models by using data that’s available freely and publicly. Think public social media profiles, news pieces, Reddit posts, and so on.

Is data from public sources ‘private’? 

Chat GPT’s technical paper says its training data includes “publicly available personal information”.  Under EU law, any data that can identify an individual is ‘personal information’. To collect and use such data, a business must meet privacy norms, regardless of whether it’s collected from the individual directly or is available publicly and freely.

Interestingly, under India’s current data protection law – rules under the Information Technology Act, data that is “freely available” or “accessible in public domain” is not considered sensitive data. And so, for collecting and using such publicly available information, you need not abide by data protection rules.

Advertisement. Scroll to continue reading.

But the draft Digital Personal Data Protection Bill 2022 (India’s current draft data protection law) takes a different position. One that’s similar to the EU approach. Even if you collect data from public sources, if it relates to an identifiable individual, it is ‘personal’. And all do’s and don’ts that attach to collection and use of personal data apply to it (with one exception – around deemed consent).

How can data be collected and used to train AI models? 

In the EU, even if a business is collecting/ scraping personal information off the Internet, it must still justify its collection and use under one of six legal ‘bases’ set out in the GDPR. User consent is one basis. Another is fulfilling a contract. But the one that is often used for training AI algorithms or for improving a product is “legitimate interests” of a business.

As such, India’s draft law doesn’t require the data collector to have legal bases. However, to collect and use personal data, a platform must get users’ consent or deemed consent, i.e. either you get actual consent from individuals or your collection/ use of data falls within one of the ‘deemed consent’ grounds recognised in law, such as processing data for complying with a court order or responding to a medical emergency or a public health response or processing data for ‘reasonable purposes’ recognised by the Indian government.

‘Deemed consent’ may help in training AI 

Taking repeated consent to collect data for training AI models is cumbersome. So developers are likely to consider two “deemed consent” grounds that could be relevant here.

Advertisement. Scroll to continue reading.

One, under the draft law, consent can be assumed when you are processing “publicly available personal data” in “public interest.  Say, if a platform scoops up a public Reddit thread where users discuss their worst dating encounters, to train its algorithm. Does the AI developer not need to take users’ consent separately to process this data since it is publicly available?

Two, consent can be inferred when an individual voluntarily provides her information and can be reasonably expected to do so. For e.g., a  user signs up on Reddit. Reddit’s privacy policy says “Much of the information on the Services is public and accessible to everyone, even without an account. By using the Services, you are directing us to share this information publicly and freely.”  Can the user’s catch-all consent to the privacy policy be considered as consent to sharing of their data with AI models like Chat GPT and Open AI’s use of that data for training algorithms?

Interestingly, platforms like Reddit are going to start charging AI developers for accessing their content. But the question of consent/ deemed consent would remain.

Using data to train AI models- A reasonable purpose?

As India seeks to establish itself as an AI powerhouse, it would be worth exploring if the use of data to train AI models should be a ‘reasonable purpose’ under India’s data protection law. This should be subject, of course, to appropriate checks and balances. For instance, similar to Italy’s guidance, individuals could be allowed the right to object to the use of their personal data for training AI models – an opt-out rather than an opt-in.

Sreenidhi Srinivasan is a Partner and Pallavi Sondhi is a Senior Associate at Ikigai Law.

Advertisement. Scroll to continue reading.

This post is released under a CC-BY-SA 4.0 license. Please feel free to republish on your site, with attribution and a link. Adaptation and rewriting, though allowed, should be true to the original.

Also Read:

Written By

Free Reads

MediaNama’s mission is to help build a digital ecosystem which is open, fair, global and competitive.



The existing commission-based model, which companies like Uber and Ola have used for a long time and still stick to, has received criticism from...


Factors like Indus not charging developers any commission for in-app payments and antitrust orders issued by India's competition regulator against Google could contribute to...


Is open-sourcing of AI, and the use cases that come with it, a good starting point to discuss the responsibility and liability of AI?...


RBI Deputy Governor Rabi Shankar called for self-regulation in the fintech sector, but here's why we disagree with his stance.


Both the IT Minister and the IT Minister of State have chosen to avoid the actual concerns raised, and have instead defended against lesser...

You May Also Like


Google has released a Google Travel Trends Report which states that branded budget hotel search queries grew 179% year over year (YOY) in India, in...


135 job openings in over 60 companies are listed at our free Digital and Mobile Job Board: If you’re looking for a job, or...


By Aroon Deep and Aditya Chunduru You’re reading it here first: Twitter has complied with government requests to censor 52 tweets that mostly criticised...


Rajesh Kumar* doesn’t have many enemies in life. But, Uber, for which he drives a cab everyday, is starting to look like one, he...

MediaNama is the premier source of information and analysis on Technology Policy in India. More about MediaNama, and contact information, here.

© 2008-2021 Mixed Bag Media Pvt. Ltd. Developed By PixelVJ

Subscribe to our daily newsletter
Your email address:*
Please enter all required fields Click to hide
Correct invalid entries Click to hide

© 2008-2021 Mixed Bag Media Pvt. Ltd. Developed By PixelVJ