Professor Gautam Kamath, and his colleagues Professor Florian Tramèr, a computer scientist at ETH Zürich, and Nicholas Carlini, research scientist at Google DeepMind, have won a best paper award at ICML 2024, the 41st International Conference on Machine Learning held from July 21 to July 27, 2024 in Vienna, Austria.
The award was given to the research team for their paper, “Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining.” According to the conference’s program chairs, more than 9,400 papers were submitted of which 2,610 were accepted for presentation. Of these, only ten were chosen for a best paper award at ICML, the leading international conference in machine learning.
“Congratulations to Gautam and his colleagues Florian and Nicholas,” said Raouf Boutaba, University Professor and Director of the Cheriton School of Computer Science. “Their position paper comes at a critical time and asks the research community to address the challenges at the intersection of machine learning and data privacy, advocating for the consideration of privacy preservation in models trained on large public datasets.”
Context
Machine learning models have made significant advancements in learning generalizable concepts from large datasets. But these models also can memorize parts of their training data, posing a threat when the data contains private information. Differential privacy provides a formal solution to this problem.
Training a machine learning model with differential privacy at the user level offers a guarantee that the model will not depend heavily on any individual’s sensitive data. This protects against the model memorizing training data. However, current approaches to differentially private learning scale poorly, sacrificing the model’s utility to provably prevent memorization.
To address this issue, many researchers proposed augmenting differentially private learning algorithms with access to public data. The goal is to first learn from large troves of non-privacy-sensitive data to learn generic features — those that are independent from anyone’s private data — that can then be fine-tuned efficiently with differential privacy applied to the sensitive data.
Suppose a company wishes to train a model on chat messages from its users. While the content of messages is private, the syntax, grammar and other features of the conversations are not. The company may wish to leverage a model that was pretrained on a large public body of chat conversations, then fine-tune it on the specific sensitive content of messages from its users.
This pretraining and transfer learning approach has become the standard to achieve state-of-the-art performance for various tasks in computer vision and natural language processing. A foundation model is first pre-trained on massive and weakly curated data typically scraped from the Internet. Afterwards, the model can be efficiently fine-tuned on various downstream tasks.
The impressive performance of foundation models naturally places them as ideal candidates for private learning. As the pretraining data comes from publicly available sources, the pretrained model is fully independent of privacy-sensitive target data of individuals. And since these models learn new tasks extremely efficiently from samples they should be able to also learn these tasks privately with only a minor impact on performance.
Researcher’s position paper
In their position paper, the researchers challenge this view and critique the public-pretraining and private-fine-tuning paradigm. They raise concerns that models trained this way may not be private or useful, questioning the validity of current findings in real-world deployments of differential privacy.
Their primary criticism challenges the idea that pretraining on publicly available Web data should be viewed as neutral from the perspective of user privacy. Pretraining data scraped en masse from the Web may be sensitive itself. Because a privacy-preserving fine-tuned model can still memorize its pretraining data, it can cause direct harm and dilute the very meaning of private learning.
Their paper raises issues with semantics when fine-tuning data is sensitive, but the pretraining data is public. As they show in their paper, the latter assumption does not match the norms and expectations of what is meant for a model to be private.
Beyond this core concern, they speculate that this paradigm might not be as useful as research suggests, and that it could even lead to a net loss of privacy during training or deployment for two main reasons —
- Current private learning benchmarks may overestimate the value of public pretraining by fixating on settings with highly overlapping public and private data distributions.
- Public pretraining performs best with large models that cannot be run on user devices, requiring the outsourcing of private data to third parties, thus trading one form of privacy for another.
Their position paper takes a critical view of the state of the field and highlights several aspects they find problematic, and issues a call to the research community to address these challenges.
To learn more about the research on which this article is based, please see Florian Tramèr, Gautam Kamath, Nicholas Carlini. Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining. Proceedings of the 41st International Conference on Machine Learning. 235:48453-48467. Also available from the conference’s proceedings.