We sat down for a conversation with Gautam Kamath, an assistant professor at the Cheriton School of Computer Science in the Faculty of Mathematics and member of the Cybersecurity and Privacy Institute, about trust, privacy, and security surrounding machine learning (ML) and natural language processing (NLP) models.
Gautam leads a research group called The Salon and was recently named a Canada CIFAR AI Chair and a Vector Institute Faculty Member in recognition of his contributions to differential privacy, machine learning and statistics. His research focuses on developing trustworthy and reliable machine learning and statistics, with a particular emphasis on addressing fundamental problems in the realms of robustness and data privacy.
The following answers been edited for clarity and brevity.
There’s a quote from Ernest Hemingway “The best way to find out if you can trust somebody, is to trust them.” Is that an approach we should take to NLP and AI driven tools we interact with?
I thought a lot about what this phrase means. My personal interpretation of it is to trust them a little, and then see if you can trust them more. I interpret that quote as a type of test. You must test someone to see if you can trust them or not, and this is often done in a lot of machine-learning contexts for security and privacy. I also interpret it to mean, can we trust them to give us something correct or not? While these models are powerful and they can do a lot of amazing things, trusting them to give you the right answers 100% of the time and make decisions in a life-or-death situation, I don’t think they are quite there yet. These models can give wrong answers, confidently wrong answers.
How concerned should users of these AI and NLP tools be about their data privacy and security?
You start by asking where ChatGPT or any of these other machine learning and NLP tools gather their data from. Essentially, how do they learn? One of the things that powered a lot of advances in machine-learning and NLPs over the last 5-10 years is large, publicly available data sets. An example is Common Crawl, which is a data set online scraped from the public Internet from a variety of different sources. You can imagine that a lot of this is going to be innocuous, perhaps random, Internet comments, jokes, and memes.
Now suppose I posted some sensitive information on my Facebook page and somehow, I misunderstood the privacy settings and it’s now accidentally visible to the world. It’s possible that information was used as training data. Down the line you don't know what these tools will do with that information and there have been cases demonstrating that these language models can spit out parts of their training data verbatim.
Are there any security and privacy concerns specifically about ChatGPT that you’ve come across?
It’s not exactly ChatGPT, but I have this paper on the closely related GPT-4 in front of me, and in the paper, they first comment on using publicly available data sets for training. The other thing I want to highlight is the paper mentions data licensed from third-party providers. This is all they tell you about their datasets, which is kind of mysterious. What do these third-party providers have about me? I'm sure it’s an appropriately licensed data set, but you can imagine at some point you clicked ‘OK’ and accepted the terms of a license agreement on an app. Now your data might be in the hands of a third-party, unless specifically stated otherwise in the agreement. They can do whatever they want with it. They can sell it or license it to other people and companies. Now this third-party data, your data, is in this massive machine learning model.
Additionally, you’re also sending them data through the prompts that you give ChatGPT. They state they will use this data to improve ChatGPT, and it essentially becomes new training data. Sensitive things that you have told it, have asked it, ChatGPT can memorize and use it. People should think about the privacy considerations in all these cases. Unfortunately, I think people have already leaked a lot of their private information just by clicking ‘accept’ on things without understanding or thinking about where their data is going to end up.
Are there any real incentives for creators of machine learning and NLP models to be more careful with user data and privacy? Or is it viewed as almost a hindrance to progress?
One reason why you might not want to be careless with user data is to maintain their trust, so they provide you with more of their data in the future. A lot of my work is on a specific notion of privacy called differential privacy and a big complaint against this notion is that while it does guarantee more individual privacy in some very precise sense, it can hurt utility. On the other hand, maybe there is an order of magnitude more data that you wouldn't be able to access unless you put privacy and security first. So, you can enhance your model’s utility by enhancing your commitment to user privacy. If you are respectful of the users’ data they might give you more data later, which can allow your model to eventually do more useful things.
We keep offering up our personal information, is there any real digital security and privacy anymore? Do we just accept a lack of security and privacy as the norm?
That's a good but tough question. On one hand, there needs to be better education about the ‘bad things’ that can happen when you allow access to your data. I'm not sure if people think about these things because the outcomes of these decisions are distant. It’s difficult for people to have foresight and understand the potential risk. There needs to be better education that explicitly says ‘hey, you did this and now I can figure out this about you’. For example, there was a study that showed it’s possible to guess your sexuality from what pages you like on Facebook, even though they’re not obviously related.
I think the security and privacy community needs to focus more on accessible and understandable information for the public. Researchers in these areas understand these risks but come across as too technical when communicating those same risks. There is often no “smoking gun” for the average user, so I think making the risks clear is something that could be done to better educate everyone.