Microsoft quietly deleted a data set of more than 10 million images, intended as a test and training data set for facial recognition algorithms, according to a report (paywall) by the Financial Times. The database, dubbed MS Celeb, was the largest public facial recognition data set in the world, and contained more than 10 million images of more than 100,000 people — largely scrapped from publicly available online sources. Uncovered by Berlin-based researcher Adam Harvey, it was reportedly being used by companies to test their facial recognition software. The takedown came after an Financial Times investigation (paywall) found that many of the people in the database were not aware of they were on it, and had not consented to having their pictures used.

In a statement to FT, Microsoft tried to downplay the controversy, saying the database was only for “academic purposes” and was run by an employee who “no longer works for the company”. The FT report noted that the MS Celeb database is still available to any academic institution or company that had previously downloaded it, and is still being shared on GitHub, Dropbox, and Baidu Cloud. Harvey also discovered very similar databases hosted by Duke and Stanford Universities’ researchers during his investigation, which have since been taken down.

Microsoft is not the only company to have assembled a large data set by scraping photos from open Internet. In January, IBM announced it was sharing a collection of 1 million publicly available faces to “study the fairness and accuracy in facial recognition technology”. It said the data set was “available the global research community upon request”.

We believe by extracting and releasing these facial coding scheme annotations on a large dataset of 1 million images of faces, we will accelerate the study of diversity and coverage of data for AI facial recognition systems to ensure more fair and accurate AI systems. Today’s release is simply the first step.


What data did MS Celeb contain?

The database contained more than 10 million images of of over 100,000 people, acquired from websites like Flickr on which a significant number of images are hosted under the Creative Commons licence, that is, they can be used free of copyright concerns. It acquired the name MS Celeb because a large number of the images in the data set were of celebrities and other public figures. However, the data set also included images of security journalists, privacy researchers and authors. Problematically, many of them had not given consent for their images to be used in this way.

The FT report said that data from MS Celeb has been used by major commercial organisations such as IBM, Panasonic, Alibaba, Nvidia, Hitachi, Sensetime and Megvii, among others. Interestingly, Sensetime and Megvii are Chinese suppliers of equipment to officials in Xinjiang, where facial recognition technology is being used to track and control the Uighurs, a primarily Muslim minority in China.

What the GDPR says about user consent

The biggest issue with a the MS Celeb database is the fact that it stored information about people who may not have given their consent. Under the EU’s General Data Protection Regulation, consent must to be informed and specific. For this, the data subject must be notified about the controller’s identity, what kind of data will be processed, how it will be used and the purpose of the processing operations as a safeguard against “function creep”. The law also states that consent must be “unambiguous”, that is, it requires either a statement or a clear affirmative act.

Ironically, Microsoft President Brad Smith revealed last October that the firm had turned down a request by California’s law enforcement agency, which wanted it to install facial recognition technology on officers’ cars and bodies, as it posed significant “human rights concerns”. Smith also had beseeched the US Congress to regulate facial recognition technology in a blog post last year, saying it has “broad societal ramifications and potential for abuse”.