Speaker: Oana Balmau, McGill University
Location: DC 1302
Abstract: Data is the driving force behind machine learning (ML) algorithms. The way we ingest, store, and serve data can impact end-to-end training and inference performance significantly. For instance, as much as 50% of the power can go into storage and data cleaning in large production settings. The amount of data that we produce is growing exponentially, making it expensive and difficult to keep entire training datasets in main memory. Increasingly, ML algorithms will need to access data directly from persistent storage in an efficient manner. To address this challenge, this work sets out to characterize I/O patterns in ML, with a focus on data pre-processing and training.
We use trace collection to understand storage impact in ML. Key factors we are investigating include the workload type, software framework used (e.g., PyTorch, Tensorflow), accelerator type (e.g., GPU, TPU), dataset size to memory ratio, and degree of parallelism. The trace collection is done mainly through eBPF and other system monitoring tools such as mpstat, and NVIDIA Nsight. Our traces include VFS-layer calls such as read, write, open, create, etc. as well as mmap calls, block I/O accesses, CPU use, memory use, and accelerator use. Based on the trace analysis, we plan to build a synthetic I/O workload generator. The workload generator will accurately reproduce I/O patterns for representative ML workloads, simulating the computation time.
Bio: Oana Balmau is an Assistant Professor in the School of Computer Science at McGill University, where she leads the DISCS Lab. Her research is focused on storage systems and data management systems, in particular for workloads in machine learning, data science, and edge computing. Oana completed her PhD in Computer Science at the University of Sydney, and earned her Bachelors and Masters degrees from EPFL, Switzerland. Oana's doctoral dissertation won the CORE John Makepeace Bennet Award 2021 for the best computer science dissertation in Australia and New Zealand and an Honorable Mention for the ACM SIGOPS Dennis M. Ritchie Doctoral Dissertation Award. Finally, Oana is a part of MLCommons, where she leads the effort on storage benchmarking in for machine learning.