- Processing non-personal data poses reidentification and group privacy risks
- Everyone has a different anonymisation standard
- Dark metadata aggregators can deanonymise by collating data from various sources and using identifiers
- Privacy harms of non-personal data can be addressed through existing laws such as IT Act
“Data cannot be either personally identifiable information (PII) or non-PII. That is completely invalid in practice. It is a spectrum from 0.0 to 1.0 in terms of what’s the harm that it could create, and it’s exceptionally based on context,” Anand Venkatanarayanan from Hasgeek said while explaining how a group of non-PII data points can be clustered to make the information personally identifiable.
Venkatanarayanan was speaking at MediaNama’s ‘Regulating Non-Personal Data’ event held on February 18, 2022. The panel on ‘Privacy and NPD’ also included Amlan Mohanty from Google, Amol Kulkarni from CUTS International, and Digvijay Chaudhary from Centre for Internet and Society, with Smriti Parsheera from CyberBRICS as the moderator.
The non-personal data (NPD) framework proposed by the expert committee set up by the government mandates the sharing of non-personal data at an aggregate level for public value, among other recommendations.
This event was organised with support from Google, PhonePe, Amazon, Meta, and Microsoft. To support future MediaNama discussions, please let us know here.
What are the risks regarding processing of non personal data?
Chaudhary explained that there are two types of risks regarding processing of personal data —
- Reidentification: “So in our data protection laws, the underlying assumption is that they consider the nature of data as stable. However, in reality the nature of data is more volatile. Therefore segregating data on the basis of definitions is extremely difficult. Especially considering the processing power of big data machine learning, and how big data upsets such regulatory mechanisms…” Chaudhary said.
- Group harm: Chaudhary said group/collective risk is another privacy harm associated with NPD. He said, “The only bright line in the NPD expert committee report was that it recognised group privacy as a concept.” The draft NPD report notes, “Collective privacy refers to possibilities of collective harm related to Non- Personal Data about a group or community that may arise from inappropriate exposure or handling of such data.”
How does free flow of data hamper anonymisation?
“Primarily, anonymisations as a technical construct have failed. The primary reason is free flow of data everywhere.” – Anand Venkatanarayanan
Data is not a limited commodity: “When data is not a limited commodity and it is free flowing everywhere, it makes it very easy for one to develop deanonymisation solution,” Venkatanarayanan said.
Data can be collated through various sources: He explained how dark data markets subvert anonymisation by collating data from various sources. “For instance, they will get data from clubs, gyms, leaked payment companies — so what you get is everyone has different anonymisation standards. And everyone has a different way of doing it. You can’t go enforce a standard on every piece of data, by saying that this is the way you have to anonymise,” Venkatanarayanan said
Data can be linked using identifiers: Venkatanarayanan said that a dark trade aggregator, even if they don’t have direct access to data, will get primary and tertiary anonymised data and then try to link it using identifiers. He said that there are two privacy harms associated with this:
- A person is correctly identified by collating anonymised data
- A person is incorrectly identified
“Now, let us say there is a company, which is dealing with health data. It tells you that in a certain PIN code they found 5000 males who have a particular problem and 4000 females which have a particular issue. Would you call it non personal (data)? Probably yes. Then you go back and leak a single piece of data somewhere else — which says that I live in PIN code number 43 and I’m a male and I’m like 45 years old. What is the probability that you are one of those people who have the disease?…this is privacy harm.” — Anand Venkatanarayanan
Mapping and mixing of data: At the end of it, de-anonymisation is simply about solving linear equations of ‘n’ variables, Venkatanarayanan said. He demonstrated this point by talking about an AI model that someone developed using credit card data. The model was supposed to predict the probable time and likelihood of repayment of loans. But academics were able to extract the credit card data, except the last four digits to a large extent by reverse engineering the model. “This is unheard of, like you’re talking about a model from which PI is extracted. Where is anonymization in this world?” he asked.
These are the other important points that Venkatanarayanan had to make regarding non-personal data —
- Classification of data is a problem: He said that a problem in privacy practice is that of classification of data. “You should know all the data you have, otherwise how can you protect it?” he asked. Companies are also supposed to figure out how data is flowing across different parts of the organisation, he said, adding that it becomes much harder considering that companies share data with outside vendors such as email providers like MailChimp, SMS providers, etc.
- Data cataloguing or classification is very difficult: At a talk curated by Venkatanarayanan, social media platform LinkedIn reportedly said that it took them three years, $18 million of expense, other infrastructure expenses, and a 40-member engineering team in order to actually build it and have a data catalogue.
Is regulating non-personal data premature?
PDP Bill addresses situations where identity of a person is zeroed in through multiple indicators: Mohanty was of the opinion that privacy harms posed by processing non-personal data can be addressed through the Personal Data Protection Bill and the existing IT Act. “A personal data protection legislation recognises a situation where somebody takes all of those data points together and is able to identify a person. So any privacy risk or harm that flows from a situation like that is going to be addressed by the Personal Data Protection bill. You do not need a separate non-personal data bill to address that privacy,” he said.
Misconceptions about personal and non-personal data in the same bill: “The JPC seems to believe that there are privacy risks with non-personal data so that is why it needs to be regulated in the personal data protection bill. That is not the case,” Mohanty said.
He further reasoned that —
- By definition, non personal data does not have any personal information.
- There is a taxonomy of data classification in privacy law which flows from sensitivity of the data involved. However, when the data is anonymised and aggregated, the sensitivity does not carry forward.
“Clearly, the Personal Data Protection Bill has a very clear conception of harm. It’s talking about bodily, mental injury, it’s talking about discrimination, it’s talking about loss of property, we really need to go into that, step by step and look at what are some of those harms that might relate to collectives, to communities, to groups of people.” — Amlan Mohanty
Data Protection Bill misses out on re-identification of anonymised data: Chaudhary said, “If you see, the PDP Bill only mentions and regulates the re-identification of de-identified data. It doesn’t regulate the re-identification of anonymised data and that is something which I think I miss when the government says that they have included non-personal data to afford privacy protection. This is a grave concern.”
How does Google ensure privacy while sharing data externally?
Mohanty also spoke about the following safeguards that are available in terms of access to NPD of citizens:
Opt-in/opt-out: “For instance, unless you opt into sharing location data, your data is not going to form part of the aggregated data set from which these insights are drawn,” he said, in reference to Google’s mobility reports.
Differential privacy: Mohanty described it as a method in which noise is added to a piece of data to obfuscate the identity of a person.
Google discarded types of datasets where reidentification was possible: Mohanty said Google’s mobility reports are very specific to a particular job, geographic location. “What we decided to do was really discard certain types of datasets where we felt maybe perhaps it is a risk of re-identification. So, for example, a geographic region less than three square kilometers is not included in this dataset because there might be a risk of re-identification,” he added.
- Contact tracing system deployed during Covid-19: Google and Apple developed an exposure notification system wherein people who were in contact with the virus will be notified of the same. It harnessed bluetooth for the same, and the idea was to help governments and government officials in contact tracing, Mohanty said. “So Version 2 of the system is actually about ensuring that privacy is maintained, you maintain the same technology, but through again different technical technologies, including differential privacy, you’re able to share, again, aggregated data sets about how people are coming in contact with each other sharing Bluetooth beacons, being able to identify kind of what actions are being taken to be able to, again, inform public health actions,” he added.
- MP Amar Patnaik on Non-Personal Data: Different DPAs would impede protection of citizens’ rights #NAMA
- Data Protection Bill 2021: MP Amar Patnaik bats for data regulators at state level
- Data Protection bill 2021: How the JPC wants to deal with non-personal data
Have something to add? Subscribe to MediaNama here and post your comment.