PhD Defence • Systems and Networking • Efficiently Training Deep Learning Models on Elastic and Heterogeneous Cloud Resources

Friday, April 10, 2026 2:00 pm - 5:00 pm EDT (GMT -04:00)

Please note: This PhD defence will take place online.

Runsheng (Benson) Guo, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Khuzaima Daudjee

Deep Neural Networks (DNNs) have demonstrated remarkable success across diverse domains, but their training requires substantial computational resources and is typically parallelized across large GPU clusters. However, such clusters are prohibitively expensive for most organizations to own and manage. Hence, instead of owning and managing their own clusters, organizations often rent clusters on cloud platforms to meet their training needs. While cloud environments offer elastic scalability and heterogeneous hardware options, they also introduce significant challenges for efficient distributed DNN training. Specifically, existing training frameworks lack support for dynamic reconfiguration during training, limiting the exploitation of cloud elasticity. Additionally, most systems assume homogeneous clusters, which rarely reflect the heterogeneous GPU clusters that organizations commonly use due to hardware availability constraints. Furthermore, heterogeneous network conditions in cloud environments create communication bottlenecks that limit the scalability of existing approaches.

This thesis presents three systems that collectively address these limitations to enable efficient distributed DNN training on elastic and heterogeneous cloud resources. First, Hydrozoa leverages cloud elasticity through serverless containers, enabling dynamic scaling and configuration changes during training without the traditional pitfalls of serverless computing. By combining data and model parallelism with fine-grained resource provisioning, Hydrozoa achieves cost-effective training while eliminating cluster management overhead. Second, Cephalo addresses heterogeneous GPU clusters by independently balancing compute and memory resources across GPUs with different capabilities. Unlike existing approaches that tie workload assignment to computational speed, Cephalo separately optimizes compute distribution through proportional batch sizing and memory utilization through intelligent partitioning of training state, activation checkpointing, and gradient accumulation strategies. Third, Zorse tackles heterogeneous network conditions, which are particularly common in heterogeneous clusters, by efficiently combining memory-efficient data parallelism with pipeline parallelism. Through interleaved pipelining, parameter and activation offloading, and heterogeneous pipeline parallelism configurations, Zorse achieves both communication and memory efficiency for training large DNN models across diverse network topologies.

The experimental evaluation demonstrates that these systems significantly improve training efficiency and resource utilization compared to existing approaches. Hydrozoa reduces training costs while providing seamless scalability, Cephalo simultaneously achieves high compute and memory utilization in heterogeneous clusters, and Zorse maintains high throughput under varying network conditions. Together, these contributions make distributed DNN training more accessible, cost-effective, and efficient in modern cloud environments, advancing the state of the art in large-scale machine learning infrastructure.


Attend this PhD defence virtually on Zoom.