Please note: This PhD seminar will be given online.
Runsheng (Benson) Guo, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Khuzaima Daudjee
Deep Neural Networks (DNNs) are often trained in parallel on a cluster of virtual machines (VMs) so as to reduce training time. However, this requires explicit cluster management, which is cumbersome and often results in costly over-provisioning of resources. Training DNNs on serverless compute is an attractive alternative that is receiving growing interest. In a serverless environment, cluster management is handled for the user, compute resources can be scaled at a fine-grained level, and users are billed only for resources that are consumed. Despite these advantages, existing serverless systems for DNN training are ineffective because they are limited to CPU-based training and bottlenecked by expensive distributed communication.
I will present Hydrozoa, a serverless system we have developed for distributed DNN training. Hydrozoa overcomes existing limitations of serverless DNN training with a novel architecture that combines serverless containers with hybrid-parallel training and supports dynamic worker scaling, which helps improve statistical training efficiency. Hydrozoa achieves significant throughput-per-dollar improvements over existing VM-based and serverless training approaches while relieving the user from the burden of managing machine clusters.
Bio: Runsheng (Benson) Guo is a PhD candidate whose research interests are in distributed ML training, serverless & cloud computing.