Xi He joined the David R. Cheriton School of Computer Science as an assistant professor in March 2019. She received her BS in computer science and applied mathematics from the University of Singapore in 2012 and her PhD in computer science from Duke University in 2018. Her research is on privacy and security for big-data management and analysis.
How are you finding life in Canada generally and Waterloo Region specifically?
This is my first time in Canada. I am enjoying it here, but it sure is cold. The people are very nice and welcoming, and everything is nearby — the campus, grocery stores, a park and forested area.
Tell us a little bit about yourself.
“For your generation to live in a better world, there is so much more our generation can do.” I was really touched by this line in Mark Zuckerberg and Priscilla Chan’s letter to their daughter.
Everyone has a unique way to contribute to a better future, and my choice is to be a researcher in and a teacher of computer science. Along my journey into computer science, I have been very fortunate to meet many great mentors and teachers. Their passion and excellence in mentoring and teaching have encouraged me to continue this journey, and to dream of becoming like them. Academia is a unique environment that combines research and teaching in a way not possible in any other place. I would like to work in such an environment, serving as a researcher who drives technology innovation for a better world and as a teacher who can instill a love of learning and inspire hope in the future generation of computer scientists and engineers.
Personally, I enjoy swimming, jogging and running. People might be surprised to learn that I used to practice martial arts. I practised Kung Fu and performed with a sword at contests and competitions, but I haven’t done that in a while.
When did you become interested in computer science?
I studied mathematics and computer science during my undergraduate degree. I thought I’d specialize in applied mathematics, but I took a few computer science courses along the way that changed my focus.
A good teacher can play an inspirational role in a student’s life. The professor who taught an introductory course on programming languages sparked my interest in computer science and that’s the main reason I continued to study it. I also took a class on databases. We built a website that displayed information, which got me interested in data exploration — how to present information in a way that people can understand it quickly and easily. I also worked on an artificial intelligence project with other students using a robot that could play games. The AI project had a lot of interesting hands-on components that got me even more interested in computer science.
What attracted you to the Cheriton School of Computer Science?
The Cheriton School has strong research groups working on both data systems and cryptography, security and privacy. My research is cross-disciplinary and draws from both of those areas. Few schools of computer science have such strong groups in these areas and the potential for synergistic research.
The school also has a supportive and collegial environment and everyone has been so welcoming and friendly. I foresee many possible research opportunities with faculty members and research groups here.
Tell us about your research on privacy and security for big-data management and analysis.
Privacy and security for data management is an important topic not just in computer science, but in our daily lives. Everyone says that privacy is important, but in practice people may discard their privacy in exchange for a service.
Think about your health data. It’s being collected by your mobile phone or smart watch using a variety of health applications. Is that data being used only to provide you with a report or suggestions? Or is it being sold to insurance companies? If it’s being sold it might affect the insurance premiums you pay or your insurability — whether you’re covered or declined. People may not think that far ahead. Are they getting fair treatment when they share their data?
There are lots of questions about how people’s data should be used, analyzed and managed. That’s what got me interested in data privacy. Few practical systems deal with data privacy, and the scale of data is so large now that data privacy has become a critically important concept.
You’ve been working on differential privacy. Can you tell us more about that?
Differential privacy can provide a strong personal guarantee of privacy when your data is being analyzed. Differential privacy has already been deployed at several companies — at Google, Apple, Microsoft, among others. It’s a promising technique to let analysts use and analyze data while providing a privacy guarantee. The techniques used by large companies need a group of privacy experts to develop solutions to process and share data. But small companies and start-ups may not have privacy experts. How can we provide a more general solution for everyone?
I’ve been working on something called CAPE or the cost-aware privacy engine. The cost here can be an accuracy cost — how much accuracy you sacrifice or how much performance you sacrifice to maintain data privacy. To protect a user’s privacy, I need to perturb the data, but in the process the data becomes less useful for analysis. For some critical applications, we may not want this to happen.
Data analysts can specify the accuracy or performance they need for their application. At the same time, we want to achieve the privacy guarantee users want. We then build a system that can support all of these requirements.
A prototype has been built recently by us called APEx or Accuracy-aware Differentially Private Data Exploration to support a class of queries on tabular data exploration. This paper was recently accepted at SIGMOD 2019. We will extend this work to more complex data types, such as relational databases, graphs, and IoTs and their respective applications.
On the other hand, dealing with real-world data and applications makes us realize that differential privacy alone cannot meet all the privacy requirements, so we proposed a novel class of privacy notions known as Blowfish privacy to generalize differential privacy. Unlike differential privacy that has a single privacy parameter to control the privacy strength, Blowfish privacy has a rich set of tuning knobs known as policy graphs to customize the privacy guarantees. This allows better accuracy bounds than differential privacy when privacy policies indicate that not all properties of an individual need to be kept secret and can become a more appealing privacy guarantee in practice.
Do you see opportunities for new, collaborative research, given the breadth and depth of research conducted at the school?
The Data Systems Group has many excellent researchers working on graph data, streaming data and traditional database design. There are many opportunities to collaborate within the group. And through undergraduate research assistanceships we can tap into the talented pool of CS students here, who are well trained in both applications and theory.
I’ve also been attending the CrySP lab group meetings and see lots of opportunities for collaboration. The school encourages people to work together. Many of us are working on common problems, and we can tackle them more successfully if we each apply our techniques and expertise.
What do you consider your most significant contribution?
My most significant contribution is bringing the concept of customizable and provable privacy including differential privacy and Blowfish privacy to real-world application and systems.
I’ve conducted demonstrations that let people visualize privacy loss and loss of data utility to satisfy differential privacy. By running tutorials on differential privacy, more people now understand what data privacy means and what the implications are when privacy is sacrificed. This effort and work allow researchers from the theoretical fields to understand the practical challenges and allow practitioners from industry to learn the latest privacy techniques, and hence encourage more collaboration from different fields and bring forth advancement in privacy-related research.
Defining data privacy is a challenging task that depends on the trust model of the application, the type of data and the use of data. To encode the privacy requirements of different applications, we proposed a class of useful privacy definitions called Blowfish privacy. This privacy framework allows the design of customized privacy guarantees and useful privacy-preserving mechanisms. In particular, this notion has inspired new privacy definitions for complex relationships such as Employer-Employee or EREE privacy used by the U.S. Census, complex data types such as location data, and distributed setting that involves secure computation between multiple parties. Blowfish privacy also generalizes differential privacy and provides a more flexible privacy-accuracy tradeoff space for real-world applications.
What work have you published recently?
Besides APEx, which was accepted at SIGMOD 2019, other work called Shrinkwrap: Differentially-Private Query Processing in Private Data Federations has been recently published. This work considers a clinical data research network or CDRN — a consortium of healthcare sites that agree to share their data for research. CDRN data providers wish to keep their data private while allowing external data analysts to query on the union of their sensitive data. A private data federation is a set of autonomous databases that share a unified query interface offering in situevaluation of SQL queries over the union of the sensitive data of its members. Existing private data federations do not scale well to complex SQL queries over large datasets.
Shrinkwrap is a query-processing engine that offers controlled information leakage with differential privacy guarantees to speed up private data federation query processing. Implementation of Shrinkwrap using RAM model or Circuit model can achieve up to 35 times performance improvement over baseline. This system also provides tunable privacy parameters to control the tradeoff between privacy, accuracy, and performance and make SQL query processing practical in the distributed setting. This work was accepted at VLDB 2019.