Please note: This PhD seminar will take place in DC 1304.
Runsheng (Benson) Guo, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Khuzaima Daudjee
Training transformer models requires substantial GPU compute and memory resources. As high-end GPUs are costly and limited in availability, heterogeneous clusters with diverse GPU types are becoming more common. Existing methods attempt to balance compute across heterogeneous GPU clusters but often underutilize compute due to memory constraints.
In this talk, I will present Cephalo, a system that optimizes compute and memory usage in heterogeneous clusters by decoupling compute distribution from training state assignment. Cephalo outperforms state-of-the-art methods by achieving significantly higher training throughput while supporting larger models and batch sizes.