PhD Seminar • Systems and Networking • Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models

Friday, November 29, 2024 1:00 pm - 2:00 pm EST (GMT -05:00)

Please note: This PhD seminar will take place in DC 1304.

Runsheng (Benson) Guo, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Khuzaima Daudjee

Training transformer models requires substantial GPU compute and memory resources.  As high-end GPUs are costly and limited in availability, heterogeneous clusters with diverse GPU types are becoming more common. Existing methods attempt to balance compute across heterogeneous GPU clusters but often underutilize compute due to memory constraints.

In this talk, I will present Cephalo, a system that optimizes compute and memory usage in heterogeneous clusters by decoupling compute distribution from training state assignment. Cephalo outperforms state-of-the-art methods by achieving significantly higher training throughput while supporting larger models and batch sizes.