Data Systems Seminar Series (2016-2017)

The Data Systems Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.

The talks are usually held on a Monday at 10:30 am in room DC 1302. Exceptions are flagged.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.


The Database Seminar Series is supported by


Paul Larson
Olga Papaemmanouil
Wolfgang Gatterbauer
Amol Deshpande
Fabian M. Suchanek
Felix Naumann
Peter Bailis
Patrick Valduriez
C. Mohan
Hans-Peter Kriegel

26 September 2016, 10:15 am, DC 1302 (Please note unusual start time)

Title: Database Systems Meet Non-Volatile Memory (NVRAM)
notes
video
Speaker: Per-Åke (Paul) Larson
Abstract: Byte addressable, non-volatile memory (NVRAM) with close to DRAM speed is becoming a reality.  Low capacity DIMMs (10s of MBs) are already available and high capacity DIMMs (100s of GB) are expected in 2017. This talk is about how database systems, in particular, main-memory databases can benefit from NVRAM.  It will begin with an outline of the characteristics of different types of NVRAM and how the operating system manages and provides applications access to NVRAM. Ensuring that data structures such as indexes in NVRAM can be recovered to a consistent state without data or memory loss after a crash is challenging. The talk will discuss what causes the difficulties and how they can be overcome. It will then show how NVRAM can be used to greatly reduce the latency of commit processing and replication. By storing a main-memory database, including indexes, it is possible to achieve near-instant recovery after a crash. The final part of the talk will discuss how this can be achieved.
Bio: Paul has conducted research in the database field for over 35 years. He served as a Professor in the Department of Computer Science at the University of Waterloo for 15 years and as a Principal Researcher at Microsoft Research for close to 20 years. Paul is a Fellow of the ACM. He has worked in a variety of areas: file structures, materialized views, query processing, query optimization, column stores, and main-memory databases among others. Paul collaborated closely with the SQL Server team to drastically improve SQL Server performance by adding column store indexes, a novel main-memory engine (Hekaton), and support for real-time analytics.

3 October 2016 10:15 am, DC 1302 (Please note unusual start time)

Title: Performance Management for Cloud Databases via Machine Learning 
notes
video
Speaker: Olga Papaemmanouil, Brandeis University
Abstract: Cloud computing has become one of the most active areas of computer science research, in large part because it allows computing to behave like a general utility that is always available on demand. While existing cloud infrastructures and services reduce significantly the application development time, significant effort is still required by cloud users, for often application deployment involves a number of challenges including but not limited to performance monitoring, resource provisioning and workload allocation. These tasks strongly depend on the application-specific workload characteristics and performance objectives, therefore their implementation burden is left on the application developers.

We argue for a substantial shift away from human-crafted solutions and towards leveraging machine learning algorithms to address the above challenges. These algorithms can be trained on application- specific properties and customized performance goals to automatically learn how to provision resources as well as schedule the execution of incoming query workloads. Towards this vision, we have developed WiSeDB, a learning-based performance management service for cloud-deployed data management applications. In this talk, I will discuss how WiSeDB uses (a) supervised learning to automatically learn cost-effective models for guiding query placement, scheduling, and resource provisioning decisions for batch workload processing, and (b) reinforcement learning to naturally adapt to changes in query arrival rates and dynamic resource availability, while being decoupled from notoriously inaccurate performance prediction models.

Bio: Olga Papaemmanouil is an Assistant Professor in the Department of Computer Science at Brandeis University since January 2009. Her research interest lies in the area of data management with a recent focus on cloud databases, data exploration, query optimization and query performance prediction. She received her undergraduate degree in Computer Science and Informatics at the University of Patras, Greece in 1999. In 2001, she received her Sc.M. in Information Systems at the University of Economics and Business, Athens, Greece. She then joined the Computer Science Department at Brown University, where she completed her Ph.D in Computer Science at Brown University in 2008. She is the recipient of an NSF Career Award (2013) and a Paris Kanellakis Fellowship from Brown University (2002).

25 October 2016, 1:00 pm, DC 2585 (Please note unusual start time and location)

Title:

Approximate lifted inference with probabilistic databases

Speaker: Wolfgang Gatterbauer, CMU
Abstract:

Performing inference over large uncertain data sets is becoming a central data management problem. Recent large knowledge bases, such as Yago, Nell or DeepDive, have millions to billions of uncertain tuples. Because general reasoning under uncertainty is highly intractable, many state-of-the-art systems today perform approximate inference by reverting to sampling. This talk shows an alternative approach that allows approximate ranking answers to hard probabilistic queries in guaranteed polynomial time, and by using only basic operators of existing database management systems (i.e., no sampling required).

(1) The first part of this talk develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds (i.e. when the new probabilities are chosen independent of the probabilities of all other variables). Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space.

(2) The second part then draws the connection to lifted inference and shows how the problem of approximate probabilistic inference can be entirely reduced to a standard query evaluation problem with aggregates. There are no iterations and no exponential blow-ups. All benefits of relational engines (such as cost-based optimizations, multi-core query processing, shared-nothing parallelization) are directly available to queries over probabilistic databases. To achieve this, we compute approximate rather than exact probabilities, with a one-sided guarantee: The probabilities are guaranteed to be upper bounds to the true probabilities, which we show is sufficient to rank the top query answers with high precision. We give experimental evidence on synthetic TPC-H data that this approach can be orders of magnitude faster and also more accurate than sampling-based approaches.

(Based on joint work with Dan Suciu from TODS 2014, VLDB 2015, and VLDBJ 2016: http://arxiv.org/pdf/1409.6052, http://arxiv.org/pdf/1412.1069, http://arxiv.org/pdf/1310.6257)

Bio: Wolfgang Gatterbauer is an Assistant Professor in the Tepper School of Business and, by courtesy, in the Computer Science Department of Carnegie Mellon University. His current research focuses on scalable approaches to performing inference over uncertain data and is supported by a Career award from the National Science Foundation. Prior to joining CMU, he was a Post-Doc in the Database group at University of Washington. In earlier times, he won a Bronze medal at the International Physics Olympiad, worked in the steam turbine development department of ABB Alstom Power, and in the German office of McKinsey & Company.

2 December 2016, 10:30 am, DC 1304 (Please note different location)

Title: Scalable Platforms for Graph Analytics and Collaborative Data Science 
notes
video
Speaker: Amol Deshpande, University of Maryland
Abstract:

For several decades now, the amount of data available to us has been growing at a pace far higher than our ability to process it; this trend, popularly referred to as "big data", has accelerated many-fold in recent years with the emergence of efficient and mass-produced scientific instruments, increasing ease of generating and publishing data, and proliferation of Internet-connected devices. In this talk, I will present an overview of two recent projects from my group at UMD on building scalable platforms for large-scale data analytics.

First, I will discuss our ongoing work on building a platform, called "DataHub", for enabling collaborative data science, where teams of data scientists can simultaneously analyze, modify, and share datasets, to understand trends and to extract actionable insights. While numerous solutions exist for specific data analysis tasks, underlying infrastructure and data management capabilities for supporting ad hoc collaboration pipelines are still largely missing. I will present our vision for a unified, dataset-centric platform for addressing these challenges, and present our recent work on: (a) efficiently managing a large number versioned datasets, (b) designing and supporting a unified query language to seamlessly query versioning and provenance information, and (c) lifecycle management of complex machine learning models like deep neural networks.

Second, I will present our initial work on extracting hidden graphs from relational databases. Although there has been much work on large-scale graph analytics, graphs are not the primary representation choice for most data today, and users who want to employ graph analytics are forced to extract data from their data stores, construct the requisite graphs, and then use a specialized engine to write and execute their graph analysis tasks. I will describe our work on a system called GraphGen, that enables users to declaratively specify graph extraction tasks over relational databases, visually explore the extracted graphs, and write and execute graph algorithms over them, either directly or using existing graph libraries like the widely used NetworkX Python library.

Bio: Amol Deshpande is a Professor in the Department of Computer Science at the University of Maryland with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). He received his Ph.D. from University of California at Berkeley in 2004. His research interests include uncertain data management, adaptive query processing, data streams, graph analytics, and sensor networks. He is a recipient of an NSF Career award, and has received best paper awards at the VLDB 2004, EWSN 2008, and VLDB 2009 conferences.

19 December 2016, 10:30 am, DC 1304 (Please note different location)

Title:

A Hitchhiker's Guide to Ontology 

notes
video
Speaker: Fabian M. Suchanek, Telecom ParisTech University
Abstract:

In this talk, I will give an overview of our recent work in the area of knowledge bases. I will first talk about our main project, the YAGO knowledge base. YAGO is now multilingual, and has grown into a larger project at the Max Planck Institute for Informatics and Télécom ParisTech. I will then talk about rule mining. We can find semantic correlations in the form of Horn rules in the knowledge base. In our newest work, we show how rule mining can be applied to predict the completeness or incompleteness of the data in the knowledge base. I will also talk about watermarking approaches to trace the provenance of ontological data. Finally, I will showcase our work on creativity in knowledge bases.

Bio: Fabian M. Suchanek is an associate professor at the Telecom ParisTech University in Paris. He obtained his PhD at the Max-Planck Institute for Informatics under the supervision of Gerhard Weikum. In his thesis, Fabian developed inter alia the YAGO-Ontology, one of the largest public ontologies, which earned him a honorable mention of the SIGMOD dissertation award. Fabian was a postdoc at Microsoft Research in Silicon Valley (reporting to Rakesh Agrawal) and at INRIA Saclay/France (reporting to Serge Abiteboul). He continued as the leader of the Otto Hahn Research Group "Ontologies" at the Max-Planck Institute for Informatics in Germany. Since 2013, he is an associate professor at Télécom ParisTech University in Paris. Fabian teaches classes on the Semantic Web, Information Extraction and Knowledge Representation in France, in Germany, and in Senegal. With his students, he works on information extraction, rule mining, ontology matching, and other topics related to large knowledge bases. He has published around 50 scientific articles, among others at ISWC, VLDB, SIGMOD, WWW, CIKM, ICDE, and SIGIR, and his work has been cited more than 5500 times.

9 January 2017, 10:30 pm, DC 1302

Title:

Data Profiling

notes
video
Speaker: Felix Naumann, Hasso-Plattner-Institut für Softwaresystemtechnik
Abstract:

Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies.

Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, current data profiling techniques hardly scale beyond what can only be called small data. Finally, more and more data beyond the traditional relational databases are being created and beg to be profiled. The talk highlights the state of the art and proposes new research directions and challenges.

Bio: Felix Naumann studied mathematics, economy, and computer sciences at the University of Technology in Berlin. After receiving his diploma (MA) in 1997 he joined the graduate school "Distributed Information Systems" at Humboldt University of Berlin. He completed his PhD thesis on "Quality-driven Query Answering" in 2000. In 2001 and 2002 he worked at the IBM Almaden Research Center on topics of data integration. From 2003–2006 he was assistant professor for information integration at the Humboldt-University of Berlin. Since then he holds the chair for information systems at the Hasso Plattner Institute at the University of Potsdam in Germany. His research interests are in data profiling, data cleansing, and text mining.

1 May 2017, 10:30 am, DC 1302

Title: MacroBase: Prioritizing Attention in Fast Data
Speaker: Peter Bailis, Stanford University
Abstract: While data volumes continue to rise, the capacity of human attention remains limited. As a result, users need analytics engines that can assist in prioritizing attention in this "fast data" that is too large for manual inspection. We are developing MacroBase, a new data analytics engine designed to prioritize attention in fast data streams. MacroBase identifies deviations within streams and generates potential explanations that help contextualize and summarize relevant behaviors. As the first engine to combine streaming classification and streaming explanation operators, MacroBase exploits cross-layer optimizations that deliver order-of-magnitude speedups over existing alternatives while allowing flexible operation across domains including sensor, video, and relational data via extensible feature transform operators. As a result, MacroBase can deliver accurate results at speeds of up to 2M events per second per query on a single core, with operators for flexible operation over time-series, video-, and relational data. MacroBase is a core component of the Stanford DAWN project, a new research initiative designed to enable more usable and efficient machine learning infrastructure.
Bio: Peter Bailis is an assistant professor of Computer Science at Stanford University. Peter's research in the Future Data Systems group focuses on the design and implementation of next-generation, post-database data-intensive systems. His work spans large-scale data management, distributed protocol design, and architectures for high-volume complex decision support. He is the recipient of an NSF Graduate Research Fellowship, a Berkeley Fellowship for Graduate Study, best-of-conference citations for research appearing in both SIGMOD and VLDB, and the CRA Outstanding Undergraduate Researcher Award. He received a Ph.D. from UC Berkeley in 2015 and an A.B. from Harvard College in 2011, both in Computer Science.

2 May 2017, 10:30 pm, DC 1302

Title: The CloudMdsQL Multistore System
notes
video
Speaker: Patrick Valduriez, Inria and Biology Computational Institute (IBC)
Abstract:

The blooming of different cloud data management infrastructures, specialized for different kinds of data and tasks, has led to a wide diversification of DBMS interfaces and the loss of a common programming paradigm. In this talk, we present the design of a Cloud Multidatastore Query Language (CloudMdsQL), and its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational and NoSQL) within a single query that may contain embedded invocations to each data store’s native query interface. The query engine has a fully distributed architecture, which provides important opportunities for optimization. The major innovation is that a CloudMdsQL query can exploit the full power of local data stores, by simply allowing some local data store native queries (e.g. a breadth-first search query against a graph database) to be called as functions, and at the same time be optimized, e.g. by pushing down select predicates, using bind join, performing join ordering, or planning intermediate data shipping. Our experimental validation, with various data stores (graph, document, relational, Spark/HDFS), and representative queries, shows that CloudMdsQL satisfies the five important requirements for a cloud multidatastore query language.

This work partially funded by the European Commission under the Integrated Project CoherentPaaS.

Bio: Patrick Valduriez is a senior researcher at Inria and LIRMM, University of Montpellier, France. He has also been a professor of Computer Science at University Paris 6 and a researcher at Microelectronics and Computer Technology Corp. in Austin, Texas. He received his Ph. D. degree and Doctorat d'Etat in CS from University Paris 6 in 1981 and 1985, respectively. He is the head of the Zenith team (between Inria and University of Montpellier, LIRMM) that focuses on data management in large-scale distributed and parallel systems (P2P, cluster, grid, cloud), in particular, scientific data management. He has authored and co-authored over 250 technical papers and several textbooks, among which “Principles of Distributed Database Systems”. He currently serves as associate editor of several journals, including the VLDB Journal, Distributed and Parallel Databases, and Internet and Databases. He has served as PC chair of major conferences such as SIGMOD and VLDB. He was the general chair of SIGMOD04, EDBT08 and VLDB09. He obtained the best paper award at VLDB00. He was the recipient of the 1993 IBM scientific prize in Computer Science in France and the 2014 Innovation Award from Inria – French Academy of Science – Dassault Systems. He is an ACM Fellow.

6 July 2017, 10:30 am, DC 1302

Title: New Era in Distributed Computing with Blockchains and Databases
notes
video
Speaker: C. Mohan, IBM Almaden Research Center
Abstract:

A new era is emerging in the world of distributed computing with the growing popularity of blockchains (shared, replicated and distributed ledgers) and the associated databases as a way of integrating inter-organizational work. Originally, the concept of a distributed ledger was invented as the underlying technology of the cryptocurrency Bitcoin. But the adoption and further adaptation of it for use in the commercial or permissioned environments is what is of utmost interest to me and hence will be the focus of this keynote. Computer companies like IBM and Microsoft, and many key players in different vertical industry segments have recognized the applicability of blockchains in environments other than cryptocurrencies. IBM did some pioneering work by architecting and implementing Fabric, and then open sourcing it. Now Fabric is being enhanced via the Hyperledger Consortium as part of The Linux Foundation. A few of the other efforts include Enterprise Ethereum, R3 Corda and BigchainDB.

While there is no standard in the blockchain space currently, all the ongoing efforts involve some combination of database, transaction, encryption, consensus and other distributed systems technologies. Some of the application areas in which blockchain pilots are being carried out are: smart contracts, supply chain management, know your customer, derivatives processing and provenance management. In this talk, I will survey some of the ongoing blockchain projects with respect to their architectures in general and their approaches to some specific technical areas. I will focus on how the functionality of traditional and modern data stores are being utilized or not utilized in the different blockchain projects. I will also distinguish how traditional distributed database management systems have handled replication and how blockchain systems do it. Since most of the blockchain efforts are still in a nascent state, the time is right for database and other distributed systems researchers and practitioners to get more deeply involved to focus on the numerous open problems.

This talk was delivered as the opening keynote at the 37th IEEE International Conference on Distributed Computing Systems (ICDCS) in Atlanta (USA) on 6 June 2017. Extensive related blockchain collateral could be found at http://bit.ly/CMbcDB

Bio:

Dr. C. Mohan has been an IBM researcher for 35 years in the database area, impacting numerous IBM and non-IBM products, the research and academic communities, and standards, especially with his invention of the ARIES family of database locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM (1997), and ACM/IEEE (2002) Fellow has also served as the IBM India Chief Scientist for 3 years (2006-2009).

In addition to receiving the ACM SIGMOD Innovations Award (1996), the VLDB 10 Year Best Paper Award (1999) and numerous IBM awards, Mohan was elected to the US and Indian National Academies of Engineering (2009), and was named an IBM Master Inventor (1997). This Distinguished Alumnus of IIT Madras (1977) received his PhD at the University of Texas at Austin (1981). He is an inventor of 50 patents. He is currently focused on Blockchain, Big Data and HTAP technologies (http://bit.ly/CMbcDB, http://bit.ly/CMgMDS). Since 2016, he has been a Distinguished Visiting Professor of China’s prestigious Tsinghua University. He has served on the advisory board of IEEE Spectrum, and on numerous conference and journal boards. Mohan is a frequent speaker in North America, Europe and India, and has given talks in 40 countries. He is very active on social media and has a huge network of followers.

More information could be found in the Wikipedia page at http://bit.ly/CMwIkP

2 August 2017, 10:30 am, DC 1302

Title: My Journey to Data Mining 
notes
video
Speaker: Hans-Peter Kriegel, Ludwig-Maximilians-Universität München
Abstract: In this talk, I will describe how I came as a database researcher (which I have not given up) to the field of data mining. The talk will present the density-based principle and its application to clustering as well as outlier and trend detection, including high dimensional data. At the end of the talk I will give a short outlook at the art of runtime evaluation. Are we comparing algorithms or implementations? 
Bio:

Hans-Peter Kriegel has been a Professor of Informatics at Ludwig-Maximilians-Universität München, Germany, since 1991. As of April 2014 he is a Professor Emeritus due to mandatory retirement in Germany, but still has a contract with the university for research. He has published over a wide range of data mining topics including clustering, outlier detection and high-dimensional data analysis.

From the beginning of his research he was and still is working on spatial data management and similarity search, now in particular on searching and mining uncertain data, including uncertain spatio-temporal data. In 2009 the Association for Computing Machinery (ACM) elected Professor Kriegel an ACM Fellow for his contributions to knowledge discovery and data mining, similarity search, spatial data management, and access-methods for high-dimensional data. So far, his more than 450 publications have been cited approximately 50,000 times according to Google Scholar. He received both international Research Awards in Data Mining and Knowledge Discovery: the 2013 IEEE ICDM Research Contributions Award and the 2015 ACM SIGKDD Innovation Award.