Results for ""
The Massachusetts Institute of Technology (MIT) has taken down a highly-cited dataset that has over 80 million labelled Google images created in 2006 after research by two scientists discovered thousands of images labelled with racist slurs for Black and Asian people, and derogatory terms used to describe women.
The dataset was removed in early July after The Register, an online tech news publication, alerted the revered American college. “The key problem is that the dataset includes, for example, pictures of Black people and monkeys labelled with the N-word; women in bikinis, or holding their children, labelled the W-word; parts of the anatomy labelled with crude terms, and so on – needlessly linking everyday imagery to slurs and offensive language, and baking prejudice and bias into future AI models,” stated the article.
As the reports came out, MIT was swift in withdrawing the dataset, while releasing an official statement and an apology, “It has been brought to our attention that the Tiny Images dataset contains some derogatory terms as categories and offensive images. This was a consequence of the automated data collection procedure that relied on nouns from WordNet. We are greatly concerned by this and apologise to those who may have been affected.
The dataset is too large (80 million images), and the images are so small (32 x 32 pixels) that it can be difficult for people to visually recognise its content. Therefore, manual inspection, even if feasible, will not guarantee that offensive images can be completely removed.” Additionally, the institute also urged the research and development community to delete all copies of the dataset and to completely stop using the dataset library for future AI training.
Vinay Prabhu, chief scientist at UnifyID, a privacy startup in Silicon Valley, and Abeba Birhane, a PhD candidate at University College Dublin in Ireland, examined the 80 million images database and discovered thousands of images labelled with racist slurs for Black and Asian people, and derogatory terms used to describe women. Their findings were published on the pre-print website arxiv.com which is also submitted to a computer-vision conference due to be held next year. “The absence of critical engagement with canonical datasets disproportionately negatively impacts women, racial and ethnic minorities, and vulnerable individuals and communities at the margins of society,” the researchers wrote in their paper.
The dataset was previously used to produce advanced object-detection techniques so that machine-learning would automatically identify images, peoples and activities depicted in still images. Applications, websites and other products that depend on neural networks that used these datasets, thus learning from the dataset, the offensive language. The dataset, until now, was a benchmark for computer-vision algorithms.