Speaker: Spyros Blanas, The Ohio State University
Abstract: We are witnessing the increasing use of warehouse-scale computers to analyze massive datasets quickly. This poses two challenges for database systems. The first challenge is interoperability with established analytics libraries and tools. Massive datasets often consist of small images (arrays) in file formats like HDF5 and FITS. We will first present ArrayBridge, an open-source I/O library that allows processing with SQL, SciDB or TensorFlow without converting between file formats. ArrayBridge can transparently optimize data placement to make I/O more than 300X faster than directly reading small files. The second challenge is scalability, as warehouse-scale computers expose communication bottlenecks in foundational data processing operations. We will present a data shuffling algorithm that carefully uses RDMA to transmit data up to 4X faster than MPI. We will then present GRASP, an aggregation algorithm for high-cardinality parallel aggregation. By carefully scheduling data transfers to leverage similarity, GRASP avoids unscaleable all-to-all communication and completes the aggregation more than 3X faster than repartitioning. We conclude by highlighting additional challenges that need to be overcome to scale database systems to massive computers.
Bio: Spyros Blanas is an assistant professor in the Department of Computer Science and Engineering at The Ohio State University. His research interest is high performance database systems, and his current goal is to build a database system for high-end computing facilities. He has received the IEEE TCDE Rising Star award and a Google Research Faculty award. He completed his Ph.D. at the University of Wisconsin–Madison where part of his Ph.D. dissertation was commercialized in Microsoft SQL Server as the Hekaton in-memory transaction processing engine.
200 University Avenue West
Waterloo, ON N2L 3G1
Canada