Xi He is an Assistant Professor at the Cheriton School of Computer Science and a member of its Data Systems Group. Her research interests span privacy and security for big-data management and analysis.
Xi and her colleague, Joseph Near, from the University of Vermont, recently published a book called Differential Privacy for Databases that focuses on techniques, algorithms and systems for answering database-style queries with differential privacy. Differential privacy is a promising approach to formalizing privacy—for writing down what privacy means as a mathematical equation.
What was the motivation to write this book?
Data privacy is extremely important for both computer science researchers and database practitioners. The book explores differential privacy, elegant mathematical formulations that can protect individuals’ privacy while still being able to gain useful insights from their data. Differential privacy was first proposed in 2006 and over the past 15 years many algorithms have been developed to satisfy such privacy.
But if you’re a novice, just starting in a job or if you’ve been tasked with implementing differential privacy into your organization’s database systems, you’d need to search through dozens and dozens of research papers, read them and figure out which algorithm is best to deploy in a particular system or for a particular purpose.
We compiled the state-of-the-art algorithms and techniques for different settings and purposes from the research literature into one resource. Our hope was to make it easy for practitioners to learn about these privacy-preserving algorithms and implement the ones that are best.
Have you worked with Joseph Near, your coauthor, before?
Jo and I hadn’t collaborated before we wrote the book, but we are familiar with each other’s work on differential privacy and state-of-the-art database systems. It was in many ways a natural collaboration as we had different but complementary strengths.
We started on the book before the pandemic was declared in March 2020, but much of the work was done during the pandemic period.
Who’s the intended audience?
This book provides an overview of state-of-the-art techniques for differential privacy, and we wrote the book so a novice could quickly learn the essentials and algorithms. We’ve compiled and summarized the algorithms in the hope that practitioners will quickly gain an understanding of what they do, so they can implement them. The book also presents practitioners and researchers who are more familiar with the subject an accessible text that covers the latest research.
We focused on techniques to answer database-style queries, on useful algorithms and their applications, and on systems and tools that implement them. We also include discussions on whether a specific algorithm is suitable in a particular setting or purpose—for example, is the algorithm good for high-dimensional data or for a database of multiple relations or federated databases—so practitioners can determine which is best in their circumstance.
Why use differential privacy?
Differential privacy allows someone to learn properties of a population while hiding information specific to individuals. Such techniques allow us to learn useful information from sensitive data while protecting the privacy of the individuals who contributed it.
An obvious example is determining the effectiveness of a new drug in a clinical trial. Medical researchers need to know if the drug is effective and if it causes side effects, but access to individual records, which could identify the people who participated in the study, is not necessary.
Differentially private algorithms intentionally reveal some information, but the goal is to control what can be learned from that information. Differential privacy is a way of formalizing privacy—writing down what privacy means as a mathematical equation.
It acts as a bridge between societal notions of privacy and the mathematical properties of privacy-preserving algorithms. We can prove that a specific algorithm satisfies differential privacy, and then determine whether that definition is a good approximation of society’s notions of privacy. Differential privacy has been successful because it serves particularly well in this role and is the best mathematical model of privacy we know of.
Why use differential privacy in database systems?
Information systems collect and process vast amounts of data, and most of it flows into databases. These database systems are specifically designed to collect, store and query data, and they have been optimized for that task. But if we would like to enable an analysis of sensitive data with differential privacy, we need to develop techniques that work for database systems because that’s where the private data is.
Integrating differentially private techniques with database systems presents significant challenges. A main goal of most database systems is to abstract away execution details, so that analysts can focus on the semantics of the queries they write instead of worrying about how they will be executed. But satisfying differential privacy requires careful control over the details of how a query is executed, which sometimes breaks this abstraction.
The techniques covered in the book represent significant progress towards building differentially private database systems. The approaches described have already resulted in useful, deployable systems, and we hope they will pave the way towards increasing the adoption of differential privacy in practice.