Seminar • Data Systems • Architecture-Tailored Parallelization for Accessible Large Model Era | Cheriton School of Computer Science

Thursday, April 4, 2024 10:30 am - 11:30 am EDT (GMT -04:00)

Please note: This seminar will take place in DC 1304.

Xupeng Miao, Postdoctoral Researcher
Computer Science Department, Carnegie Mellon University

In this talk, I will introduce my work on machine learning (ML) parallelization, a critical endeavor to bridge the significant gap between diverse ML programs and multitiered computing architectures. Specifically, I will explore ML parallelization at three distinct yet interconnected levels.

First, I will show that by leveraging the unexplored space of model partitioning strategies, distributed ML training can be up to 20x faster than existing systems by improving communication efficiency. I will highlight some innovative distributed ML systems, such as HET for sparse embedding models and Galvatron for dense Transformer models, respectively. Second, I will discuss how to improve GPU utilization through ML parallelization. I will present SpecInfer, a system that reduces large language model (LLM) serving latency by 1.5-3.5x compared to existing systems by leveraging a novel tree-based speculative inference and verification mechanism. Third, I will demonstrate how ML parallelization popularizes LLMs by extending its boundaries throughout inter-cloud environments. I will describe SpotServe, the first LLM serving system on spot instances, handling preemptions with dynamic reparallelization, ensuring relatively low tail latency, and reducing monetary cost by 54%. Finally, I will conclude with a discussion on pushing my research forward to a holistic and unified infrastructure for democratizing ML.

Bio: Xupeng Miao is currently a postdoc researcher at Carnegie Mellon University working with Prof. Zhihao Jia and Prof. Tianqi Chen. Before that, he received his Ph.D. degree from Peking University advised by Prof. Bin Cui.

He is broadly interested in machine learning systems, data management, and distributed computing. His research has resulted in 30+ publications (with 13 first-authored papers) in top-tier conferences, including OSDI, ASPLOS, SIGMOD, VLDB, NSDI, NeurIPS and so on. Recently, he has focused on building efficient, scalable, and affordable software systems (e.g., FlexFlow Serve) for large language models. His work was recognized through the 2022 ACM China Doctoral Dissertation Award, the Best Scalable Data Science Paper Award of VLDB 2022, and the Distinguished Artifact Award of ASPLOS 2024.

Location Information

Location Address: DC - William G. Davis Computer Research Centre
200 University Avenue West
DC 1304
Waterloo, ON, CA N2L 3G1

Location coordinates: