Pengyu Nie obtained his PhD in 2023 and MSc in 2020 from The University of Texas at Austin, where he was advised by Milos Gligoric. He has a BSc from University of Science and Technology of China, which he received in 2017.
Pengyu’s research lies at the intersection of machine learning, natural language processing and software engineering, with a focus on improving productivity of developers during software development, testing, and maintenance. Specific topics include combining machine learning and code execution for test completion and lemma naming, learning to evolve code and comments, and frameworks for maintaining executable comments and specifications.
He has published 20 papers to date, many of which are in top-tier software engineering, natural language processing, and programming language conferences. He is the recipient of two ACM SIGSOFT Distinguished Paper Awards, one at the 2023 ACM SIGSOFT International Symposium on Software Testing and Analysis and another at the 2019 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. He is also a recipient of The University of Texas at Austin Graduate School Fellowship.
What follows is a lightly edited transcript of a Q&A interview.
When did you become interested in machine learning and natural language processing for software engineering?
I started working on this field at the beginning of my PhD. It was one of the research directions my advisor wanted to explore. I was also interested in it because I wanted to use machine learning and natural language processing models to help software developers write more comprehensible and elegant code.
At the time it was a very new field. My first project was to automate the maintenance of to-do comments. The idea is that developers often write comments like “TODO: fix this if some condition is met,” but they may forget and never get back to it, especially if the comment is buried deeply in the code. To tackle this problem, we built a framework called TrigIt that allows developers to write these kind of to-do comments in executable formats, meaning that they will be automatically triggered when the condition is met. It was a fun project, and a good example of how machine learning or natural language processing can be integrated into the software development workflow to improve the quality of the code. I’m quite proud of this work. It won an ACM SIGSOFT Distinguished Paper Award.
Nowadays, we have much more powerful machine learning and natural language processing models, such as the large language models. With these tools, I think the research on machine learning and natural language processing for software engineering is more important than ever because we are shaping the future of how people develop and use software.
What attracted you to the Cheriton School of Computer Science?
Waterloo has one of the top computer science schools in the world. We have many strong researchers and students in both software engineering and machine learning. The first time I had visited Waterloo was during my faculty interview, but it felt like being at home. I was welcomed warmly by so many colleagues. We spoke a lot of common languages — everyone here is passionate about research and collaboration. It was at that moment that I knew I wanted to be part of this community. This is the right place for me.
I was also attracted by the excellent students at Waterloo. During my faculty interview, many grad students attended my talk and asked good, challenging questions. I was impressed by their enthusiasm and intelligence. The undergrad students here are also strong and some are very interested in research, something I learned after I joined and began working with them. I think our unique co-op oriented undergrad program is preparing them well for their future research and industry careers.
Tell us more about your research.
My research is mainly about improving the productivity of software developers from software development to maintenance to testing. My research projects typically start by identifying some real-world problems that developers are facing, such as writing tests which can be a tedious task. I then design the techniques, usually with machine learning and natural language processing models or by software engineering program analysis — usually using a combination of both — to solve the problem. The story ends by deploying the techniques in the real world. Do they improve the software development workflow?
At the beginning of my PhD, I started by focusing on the problems of generating and maintaining comments in code. Later, as we had more powerful machine learning and natural language processing models available, I expanded to more challenging problems, such as generating software tests and proofs. These targets are more challenging because they involve significantly more complex reasoning about code and domain expertise.
At Waterloo, I am further expanding this line of research under the umbrella of software engineering plus machine learning and natural language processing. For example, I’m looking at the training and inference of those machine learning models that are already being used in software engineering.
Do you see opportunities for collaboration at the School of Computer Science?
Definitely. We have a large and very strong software engineering research group at Waterloo with a diverse set of expertise. I see a lot of collaborative opportunities in my group, and we plan on co-supervising some incoming students.
But I also see collaborative opportunities with other research groups. I am connecting with the programming languages group to explore machine learning for proofs. There are several assistant professors working on natural language processing who will be joining the school soon, and I am excited about working with them.
What do you see as your most significant contribution?
It’s more a series of papers as opposed to a single paper. My work has brought software engineering domain knowledge into the design of machine learning models for software engineering. The early machine learning models for software engineering originated within the NLP community, and were designed to process natural language text. Software, however, is very different from natural language in two important ways: one, software code follows strict grammar and is executable, and, two, software evolves over time during its development and maintenance lifecycle.
My most significant contribution is integrating these two pieces of domain knowledge into machine learning models and significantly improving their performance in challenging software engineering tasks such as code summarization and test generation. I believe that eventually we will need fundamentally different kinds of machine learning models for software engineering, with improved reasoning capabilities and integration with software engineering tools.
Examples of my previous research are TeCo, a deep learning model that uses code semantics for test completion, and CoditT5, a large language model for software-related editing tasks which is pretrained on large amounts of source code and natural language comments. This is something I am excited about and will continue to work on.
Who has inspired you most?
My PhD advisor, Milos Gligoric, is my role model. I learned a lot from him, including the skills to conduct impactful research, to operate research projects, and to manage a research group. Importantly, he also taught me to balance research with life, though the life of a researcher is about doing research. He was very supportive of my career development. When I started my PhD, I had little idea about how the academic world works, and he showed me the way to success in this world. Now, as an assistant professor myself, I am following his path and trying to be a good mentor to my students.
What do you do in your spare time?
I sometimes play video games. I used to do that a lot more when I was a student, but now I have less time. Cooking is one of my hobbies and I developed my culinary skills during the pandemic. I am also trying to pick up some instruments, like piano, which I learned when I was young.