Differential privacy is the way forward in the age of Machine Learning

Pillars
IndiaAI Portal
Resources
Ecosystem
Sectors

Back

Results for ""

IndiaAI Recommends

The buzzword of the technology world right now is none other than Machine Learning, which is also for good reasons. Companies deploy ML models for purposes including data security, predicting financial trading, precise healthcare diagnosis, marketing personalisation, detecting fraud, making recommendations, building autonomous cars and many a thing. Moreover, a lot of consumer data is fed into these ML models to perform various tasks.

AI models might sometimes remember specifics about the data they've been trained on and 'leak' these details later. Take, for example, during the pandemic, data collected from people's use of mobile phones, emails, banking, social media, and postal services, for example, was used to track the spread of the virus. However, if compromised, the same medical data may lead to COVID fraud scams, as experienced in the past.

Here comes the concept of differential privacy is a framework for assessing and limiting the risk of data leakage.

How did it gain traction?

Some of the classical and conventional techniques, including hashing to ensure privacy is outdated. A real-world example, in 2005, the streaming service Netflix held a competition in which contestants were challenged to predict how a user would rate a film based on their previous movie ratings and the kinds of movies they had seen.

For issues of privacy, the competition had to be cancelled early, but why? Students from the University of Texas were able to identify the hashed users successfully. The Netflix team's belief that simply hashing user data would prevent privacy assaults was completely incorrect, as they failed to consider one of the essential components – the linkage attack.

Due to the over-parameterisation of deep-neural networks, Machine Learning models can sometimes inadvertently memorise individual samples, resulting in undesirable data breaches. Finally, it can be said with confidence that differential privacy is a delicate balancing act between privacy preservation and model utility or validity.

Say hello to differential privacy

Differential privacy, to be precise, is a method of publicly giving out information about a dataset by explaining the patterns of groups within it while intentionally withholding information about individuals. The technique thus allows companies to customise levels of privacy and leave attackers with partially correct data only. As a result, it has some major advantages over traditional approaches:

Prevent bad actors from accessing accurate data: Applying differentially private computing to each query independently would result in different results from different researchers for the same query. These various approximations are nevertheless useful for aggregate data, and they ensure that a querier cannot divulge information about individuals. Any individual in the dataset can credibly deny their unique information or even participation in the dataset because of the added noise.
Protection against linkage attacks: A linkage attack simply attempts to re-identify individuals in an anonymised dataset by integrating the data available in one dataset with background information from another dataset. To build identifying relationships, the linking may use quasi-identifiers that are present in both groups, such as zip or postcode, gender, salary, and so on. Traditional techniques of data masking remain susceptible to linkage attacks, while DP comes in handy here.
Customisation of privacy levels: In the differential privacy method, the addition of randomness and noise into the raw dataset, often denoted as epsilon, can be adjusted depending upon the sensitivity of the data.

However, certain limitations to the method exist. The method does not allow an analyst to understand data and learn details about a specific individual. Take, for instance; the method will remain of little use to banks looking for instances of fraudulent activities. Also, the inaccuracy or noise added via DP can be ignored for a large dataset. However, the same can severely impact the analysis of a small dataset.

Conclusion

Multiple differential privacy tools from big tech firms are open-sourced. These include Opacus from Facebook, TensorFlow Privacy from Google, Diffprivlib v0.4 from IBM, PyDP from OpenMined, and OpenDP from Harvard and Microsoft. According to a white paper published by the Simons Institute at the University of California, Berkeley, differential privacy is a viable alternative to traditional anonymisation techniques. Policymakers should collaborate closely with researchers to develop recommendations.

Want to publish your content?

Publish an article and share your insights to the world.

IndiaAI Recommends