Meet Hong Zhang, a professor who develops high-performance, scalable systems for big data and machine learning applications | Cheriton School of Computer Science

Hong Zhang joined the Cheriton School of Computer Science as an Assistant Professor in 2023. He develops high-performance, scalable systems for big data and machine learning applications. His research advocates an application-oriented design principle for big data and machine learning systems that fully exploit application-specific structures such as communication patterns, execution dependencies, and machine learning model structures to suit application-specific performance demands.

This principle has led to development of several scalable systems with theoretically sound scheduling algorithms that are tailored for different big data and machine learning applications.

The following is a lightly edited transcript of a Q and A interview.

Tell us a bit about yourself.

Before joining the Cheriton School of Computer Science, I was a postdoc in the RISELab at the University of California, Berkeley. Prior to that, I completed my PhD in the iSING Lab at Hong Kong University of Science and Technology. I am from Wuhu, a city sitting on the southeast bank of the Yangtze River.

When did you become interested in high-performance, scalable systems?

My master’s degree is in electronics and communications engineering. Back then, the lab I was in focused mostly on wireless sensor networks. One day I came across a fascinating paper titled “Above the Clouds: A Berkeley View of Cloud Computing.” This paper planted a seed — that we can view computing as a utility much the way we can view provision of water and electricity as utilities. I was also impressed by how big data analytics can change our lives by harnessing huge amounts of data and compute power. After reading more papers on the topic I decided to pursue a PhD related to big data, distributed systems, and cloud computing.

After my PhD, I went to UC Berkeley to be a postdoctoral researcher in the very lab that produced the Berkeley View of Cloud Computing paper. It was interesting to come full circle and be part of the research group that sparked my interest in scalable computing and attracted me to this research area in the first place.

What attracted you to the Cheriton School of Computer Science?

During my PhD I worked closely with several amazing researchers, including Mosharaf Chowdhury, a computer science professor at the University of Michigan. It turns out that Mosharaf did his master’s degree at Waterloo under the supervision of Raouf Boutaba, the Director of the Cheriton School of Computer Science. I learned a lot about the School of Computer Science from Mosharaf. I also met many strong undergrad and graduate students at Berkeley during my postdoctoral fellowship, students who had completed their computer science degrees at Waterloo. I kept seeing and being impressed by these Waterloo computer science graduates.

During my faculty interview, I also saw that the School of Computer Science provides a supportive and collegial environment. The professors I met were super friendly and helpful. It was clear Waterloo is a great place to be with many opportunities to collaborate with both strong students and exceptional colleagues. I felt this was the right place for me to start my academic career and to build my research group.

Tell us a bit about your research.

I work on systems and networking in general. At a high level, my research focuses on developing high-performance and scalable distributed systems for big data and machine learning applications.

Big data and machine learning applications are being used increasingly to harvest massive amounts of data and as a result they require large-scale and high-performance systems to transfer, to store, and to process data at these massive scales. In response to this critical need, my research focuses on designing systems that efficiently leverage the underlying infrastructure resources — the compute, the memory, and the network — to optimize the performance of these big data and machine learning applications.

To go a little deeper into this, big data and machine learning applications have application-specific internal structures, including different communication patterns, execution dependencies, and machine learning model structures. Moreover, applications have various resource and performance demands, exposing different trade-offs between cost, latency, and throughput. My research aims to address two key questions: First, what are the critical application structures and demands that can potentially impact system design? Second, how can we build efficient systems based on these application-specific features?

Do you see opportunities for collaborative research?

Yes, very much so. As a systems researcher I obviously see the potential for collaboration with members of the Systems and Networking Group and with the Data Systems Group. Moreover, I’m excited to work with the researchers in the Machine Learning group to build more efficient machine learning systems. With a deeper understanding of machine learning workloads, we can collaborate to create more tailored system optimizations.

I’m also interested in the intersection between systems and computer security. There are many interesting research problems regarding system design for federated learning and homomorphic encryption. In this area, I’m hoping to collaborate with members of the Cryptography, Security and Privacy group.

What do you see as your most important contribution?

My most important contribution isn’t a single paper. Rather, it’s more about identifying fundamental challenges and opportunities in systems design that are brought about by new workloads, new compute paradigms, and new techniques.

For instance, CODA, a project I led during my PhD, was among the earliest attempts to optimize big data analytics using machine learning techniques. An interesting insight from the project was the examination of how machine learning errors can affect system performance and the proposal of designing error-tolerance mechanisms to minimize their impact. When serverless computing emerged a few years ago, I proposed Caerus, which demonstrated how the serverless computing paradigm could transform the way users schedule data analytics workloads. As model serving and training are now becoming the most crucial workloads, I proposed SHEPHERD, a project that identified the key features of model serving workloads and showed how they can affect the design of model serving systems.

Who has inspired you most?

My greatest inspirations in my academic journey have been my PhD supervisor, Kai Chen, and my postdoc supervisor, Ion Stoica. They are at different career stages and have distinct supervision styles. I have been fortunate enough to learn from both of them.

As one of Kai’s first graduate students, we collaborated closely, and his mentorship helped shape my research interests while emphasizing simplicity in design and clarity in presentation. Also, watching him build a successful research group from scratch has been invaluable, as I now navigate similar challenges as an assistant professor.

My postdoc advisor, Ion Stoica, is one of the most established and successful researchers in the systems area. He provided me with this unique platform where I can work with and learn from an exceptional group of researchers with diverse styles. He’s a great role model from whom I learned a lot, from identifying fundamental problems in systems design to conducting research with great real-world impact.

What do you do in your spare time?

I enjoy hiking and I like to explore new places on foot. I also enjoy watching movies with novel settings and interesting plots. In the past I spent a considerable amount of time playing a diverse range of strategy PC games. I feel that in some ways the problems I solve in these games, such as resource management, pipelining, event handling, and prioritization, are similar to the ones I encounter in my research.