Amol
Deshpande,
Department
of
Computer
Science
University
of
Maryland
Abstract: For several decades now, the amount of data available to us has been growing at a pace far higher than our ability to process it; this trend, popularly referred to as "big data", has accelerated many-fold in recent years with the emergence of efficient and mass-produced scientific instruments, increasing ease of generating and publishing data, and proliferation of Internet-connected devices. In this talk, I will present an overview of two recent projects from my group at UMD on building scalable platforms for large-scale data analytics.
First, I will discuss our ongoing work on building a platform, called "DataHub", for enabling collaborative data science, where teams of data scientists can simultaneously analyze, modify, and share datasets, to understand trends and to extract actionable insights. While numerous solutions exist for specific data analysis tasks, underlying infrastructure and data management capabilities for supporting ad hoc collaboration pipelines are still largely missing. I will present our vision for a unified, dataset-centric platform for addressing these challenges, and present our recent work on: (a) efficiently managing a large number versioned datasets, (b) designing and supporting a unified query language to seamlessly query versioning and provenance information, and (c) lifecycle management of complex machine learning models like deep neural networks.
Second, I will present our initial work on extracting hidden graphs from relational databases. Although there has been much work on large-scale graph analytics, graphs are not the primary representation choice for most data today, and users who want to employ graph analytics are forced to extract data from their data stores, construct the requisite graphs, and then use a specialized engine to write and execute their graph analysis tasks. I will describe our work on a system called GraphGen, that enables users to declaratively specify graph extraction tasks over relational databases, visually explore the extracted graphs, and write and execute graph algorithms over them, either directly or using existing graph libraries like the widely used NetworkX Python library.
Presentation slides (PDF)
Video of presentation (mp4)
Bio: Amol Deshpande is a Professor in the Department of Computer Science at the University of Maryland with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). He received his Ph.D. from University of California at Berkeley in 2004. His research interests include uncertain data management, adaptive query processing, data streams, graph analytics, and sensor networks. He is a recipient of an NSF Career award, and has received best paper awards at the VLDB 2004, EWSN 2008, and VLDB 2009 conferences.