wordpress blog stats
Connect with us

Hi, what are you looking for?

Microsoft takes down world’s largest facial recognition data set amid privacy concerns

A Microsoft building

Microsoft quietly deleted a data set of more than 10 million images, intended as a test and training data set for facial recognition algorithms, according to a report (paywall) by the Financial Times. The database, dubbed MS Celeb, was the largest public facial recognition data set in the world, and contained more than 10 million images of more than 100,000 people — largely scrapped from publicly available online sources. Uncovered by Berlin-based researcher Adam Harvey, it was reportedly being used by companies to test their facial recognition software. The takedown came after an Financial Times investigation (paywall) found that many of the people in the database were not aware of they were on it, and had not consented to having their pictures used.

In a statement to FT, Microsoft tried to downplay the controversy, saying the database was only for “academic purposes” and was run by an employee who “no longer works for the company”. The FT report noted that the MS Celeb database is still available to any academic institution or company that had previously downloaded it, and is still being shared on GitHub, Dropbox, and Baidu Cloud. Harvey also discovered very similar databases hosted by Duke and Stanford Universities’ researchers during his investigation, which have since been taken down.

Microsoft is not the only company to have assembled a large data set by scraping photos from open Internet. In January, IBM announced it was sharing a collection of 1 million publicly available faces to “study the fairness and accuracy in facial recognition technology”. It said the data set was “available the global research community upon request”.

We believe by extracting and releasing these facial coding scheme annotations on a large dataset of 1 million images of faces, we will accelerate the study of diversity and coverage of data for AI facial recognition systems to ensure more fair and accurate AI systems. Today’s release is simply the first step.


What data did MS Celeb contain?

The database contained more than 10 million images of of over 100,000 people, acquired from websites like Flickr on which a significant number of images are hosted under the Creative Commons licence, that is, they can be used free of copyright concerns. It acquired the name MS Celeb because a large number of the images in the data set were of celebrities and other public figures. However, the data set also included images of security journalists, privacy researchers and authors. Problematically, many of them had not given consent for their images to be used in this way.

Advertisement. Scroll to continue reading.

The FT report said that data from MS Celeb has been used by major commercial organisations such as IBM, Panasonic, Alibaba, Nvidia, Hitachi, Sensetime and Megvii, among others. Interestingly, Sensetime and Megvii are Chinese suppliers of equipment to officials in Xinjiang, where facial recognition technology is being used to track and control the Uighurs, a primarily Muslim minority in China.

What the GDPR says about user consent

The biggest issue with a the MS Celeb database is the fact that it stored information about people who may not have given their consent. Under the EU’s General Data Protection Regulation, consent must to be informed and specific. For this, the data subject must be notified about the controller’s identity, what kind of data will be processed, how it will be used and the purpose of the processing operations as a safeguard against “function creep”. The law also states that consent must be “unambiguous”, that is, it requires either a statement or a clear affirmative act.

Ironically, Microsoft President Brad Smith revealed last October that the firm had turned down a request by California’s law enforcement agency, which wanted it to install facial recognition technology on officers’ cars and bodies, as it posed significant “human rights concerns”. Smith also had beseeched the US Congress to regulate facial recognition technology in a blog post last year, saying it has “broad societal ramifications and potential for abuse”.

Written By

MediaNama’s mission is to help build a digital ecosystem which is open, fair, global and competitive.



India and US come to terms on how to deal with the equalisation levy in light of the impending Global Tax Deal.


Find out how people’s health data is understood to have value and who can benefit from that value.


The US and other countries' retreat from a laissez-faire approach to regulating markets presents India with a rare opportunity.


When news that Walmart would soon accept cryptocurrency turned out to be fake, it also became a teachable moment.


The DSCI's guidelines are patient-centric and act as a data privacy roadmap for healthcare service providers.

You May Also Like


Google has released a Google Travel Trends Report which states that branded budget hotel search queries grew 179% year over year (YOY) in India, in...


135 job openings in over 60 companies are listed at our free Digital and Mobile Job Board: If you’re looking for a job, or...


Rajesh Kumar* doesn’t have many enemies in life. But, Uber, for which he drives a cab everyday, is starting to look like one, he...


By Aroon Deep and Aditya Chunduru You’re reading it here first: Twitter has complied with government requests to censor 52 tweets that mostly criticised...

MediaNama is the premier source of information and analysis on Technology Policy in India. More about MediaNama, and contact information, here.

© 2008-2021 Mixed Bag Media Pvt. Ltd. Developed By PixelVJ

Subscribe to our daily newsletter
Your email address:*
Please enter all required fields Click to hide
Correct invalid entries Click to hide

© 2008-2021 Mixed Bag Media Pvt. Ltd. Developed By PixelVJ