Database Seminar Series (2014-2015)

The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.

The talks are usually held on a Wednesday at 2:30 pm in room DC (Davis Centre) 1302.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.


The Database Seminar Series is supported by


Frank Dehne
Jignesh Patel
Nesime Tatbul
Cristiana Amza
Aaron Elmore
Wang-Chiew Tan
Sudipto Das
Wolfgang Lehner

24 September 2014, 2:30 pm, DC 1302

Title: Real-Time On-line Analytical Processing (OLAP) On Multi-Core and Cloud Architectures (PDF)
Speaker: Frank Dehne, Carleton University
Abstract:

In contrast to queries for on-line transaction processing (OLTP) systems that typically access only a small portion of a database, on-line analytical processing (OLAP) queries may need to aggregate large portions of a database which often leads to performance issues. We have built multi-core and cloud based real-time OLAP systems utilizing a new distributed index structure for OLAP, termed distributed PDCR tree. Our system supports multiple dimension hierarchies and efficient query processing on elaborate dimension hierarchies which are central to OLAP systems. It is particularly efficient for complex OLAP queries that need to aggregate large portions of a data warehouse. Our project is partially funded by IBM and our system has won an "IBM Innovation Impact of the Year" award.

Bio: Frank Dehne received a MCS (Dipl. Inform.) from RWTH Aachen University, Germany and a PhD from the University of Wuerzburg, Germany. He is currently Chancellor's Professor of Computer Science at Carleton University in Ottawa, Canada. His research program is focused on improving the performance of big data analytics systems, in particular for business intelligence and computational biochemistry, through efficient parallel computing methods for multi-core processors, GPUs, processor clusters and clouds. He is serving or has served on the Editorial Boards of IEEE Transaction on Computers, Information Processing Letters, Journal of Bioinformatics Research and Applications, and Int. Journal of Data Warehousing and Mining. He is a member and former vice-chair of the IEEE Technical Committee on Parallel Processing, and member of the ACM Symposium on Parallel Algorithms & Architectures Steering Committee. Since 2010, he is a Fellow of the IBM Centre For Advanced Studies Canada (Business Intelligence and Business Analytics section).

15 October 2014, 2:30 pm, DC 1302

Title: Towards hardware-software co-design for data analytics: A plea and a proposal (PDF) video
Speaker: Jignesh Patel, University of Wisconsin
Abstract: Traditionally data processing kernels have played a catch-up game to the changes that hardware folks (aka. architects) make. In the mid 90s, we (aka. the database folks) realized that architects have added processor caches, so we starting rewriting data processing kernels to make better use of caches. Then, we realized that there is this thing called the TLB, misses to which are outrageously expensive. So, we went and fixed our data processing software to make better use of TLBs. In the earlier part of this century, we realized that the processors have made tremendous advances in micro-architecture with features like out-of-order execution, so we reacted to that. We now find that in some cases we need to throw away changes that we made in reaction to previous architectural events, and go back to the drawing board. This game of waiting for architects to give us new architectural features and then us in the database world reacting to it, repeated in an endless cycle, is wasteful and unsustainable given the rapid changes that are happening at the architecture and the database levels. Can we find a more synergistic and collaborative way of co-charting the future? In this talk, I will describe the approach that we are taking to answer this question in the Quickstep project. To help move forward, I will also present two simple data processing kernels that are likely critical to power analytical data processing workloads. Can the architecture and database communities get together to co-design hardware and software artifacts to produce predictable and scalable performance on these two kernels, while minimizing the energy consumed when running these kernels? I'll argue that addressing this question can lead to a fundamental shift in moving towards a true hardware software co-design paradigm for data processing applications.
Bio: Jignesh Patel is a Professor in Computer Sciences at the University of Wisconsin-Madison, where he also earned his Ph.D. He has worked in the area of databases (now fashionably called "big data") for over two decades. He has won several best paper awards, and industry research awards. He is the recipient of the Wisconsin COW teaching award, and the U. Michigan College of Engineering Education Excellence Award. He has a strong interest in seeing research ideas transition to actual products. His Ph.D. thesis work was acquired by NCR/Teradata in 1997, and he also co-founded Locomatix — a startup that built a platform to power real-time data-driven mobile services. Locomatix became part of Twitter in 2013. He is an ACM Distinguished Scientist and an IEEE Senior Member. He also serves on the board of Lands' End, and advises a number of startups. He blogs at http://bigfastdata.blogspot.com.

19 November 2014, 2:30 pm, DC 1304 (please note alternate location)

Title:

S-Store: A Streaming NewSQL System for Big Velocity Applications (Cancelled)

Speaker: Nesime Tatbul, Intel Labs and MIT
Abstract: Managing high-speed data streams in real time has become an integral part of today’s big data applications. In a significant portion of these applications (e.g., leaderboard maintenance, online advertising), we see a critical need for real-time stream processing to co-exist with transactional state management. Yet, existing systems treat streaming and transaction processing as two separate computational paradigms, which makes it difficult to build such applications to execute correctly and scalably. S-Store is a new data management system that provides a single, scalable platform for transactional stream processing. S-Store takes its architectural foundation from H-Store - a modern distributed main-memory OLTP/NewSQL system, and adds well-defined primitives to support data-driven processing such as streams, windows, triggers, and workflows. Furthermore, it makes a number of careful extensions to H-Store's traditional transaction model in order to maintain ACID guarantees in the presence of data and processing dependencies among transaction executions that involve streams. In this talk, I will present S-Store's design and implementation, and show how S-Store can ensure transactional integrity without sacrificing performance.
Bio:

Nesime Tatbul is a senior research scientist at the Intel Science and Technology Center for Big Data based at MIT CSAIL. Before joining Intel Labs, she was a faculty member at the Computer Science Department of ETH Zurich. She received her B.S. and M.S. degrees in Computer Engineering from the Middle East Technical University (METU), and her M.S. and Ph.D. degrees in Computer Science from Brown University. During her graduate school years at Brown, she also worked as a research intern at the IBM Almaden Research Center, and as a consultant for the U.S. Army Research Institute of Environmental Medicine (USARIEM). Her research interests are in database systems, with a recent focus on data stream processing and distributed data management. She is the recipient of an IBM Faculty Award in 2008, a Best System Demonstration Award at the ACM SIGMOD 2005 Conference, and both the Best Poster Award and the Grand Challenge Award at the ACM DEBS 2011 Conference. She has served on the program committee for various conferences including ACM SIGMOD (as an industrial program co-chair in 2014 and as a group leader in 2011), VLDB, and IEEE ICDE (as a PC track chair for Streams, Sensor Networks, and Complex Event Processing in 2013). She has chaired a number of VLDB co-located workshops including the International Workshop on Data Management for Sensor Networks (DMSN) and the International Workshop on Business Intelligence for the Real-Time Enterprise (BIRTE). Her recent editorial duties include PVLDB (associate editor, Volume 5, 2011-2012) and ACM SIGMOD Record (associate editor, Research Surveys Column, since June 2012).

21 January 2015, 2:30 pm, DC 1302

Title: Stage-Aware Anomaly Detection through Execution Flow Tracking (PDF) video
Speaker: Cristiana Amza, University of Toronto
Abstract:

Modern Cloud and Data Center environments are based on large scale distributed storage systems. Diagosing configuration errors, software bugs and performance anomalies in such systems has become a major problem for large Web hosting sites.

As part of a larger project, which endeavors to design and prototype interactive, guided modelling for such systems I will introduce Stage-aware Anomaly Detection (SAAD), a low overhead real-time solution for detecting runtime anomalies in storage systems. SAAD is based on the key observation that most state-of-the-art storage server architectures are multi-threaded and structured as a set of modules, which we call stages.

SAAD leverages this observation to collect stage-level log summaries at runtime and to perform statistical analysis across stage instances. Stages that generate rare execution flows and/or register unusually high duration for regular flows at run-time indicate anomalies. SAAD makes two key contributions: i) limits the search space for root causes, by pinpointing specific anomalous code stages, and ii) reduces compute and storage requirements for log analysis, while preserving accuracy, through a novel technique based on log summarization.

We evaluated SAAD on three distributed storage systems: HBase, Hadoop Distributed File System (HDFS), and Cassandra. We show that, with practically zero overhead, we uncover various anomalies in real-time.

Bio: Cristiana Amza received her B.S. degree in Computer Engineering from Bucharest Polytechnic Institute in 1991, the M.S. and the Ph.D. degrees in Computer Science from Rice University in 1997 and 2003 respectively. Her research interests are in the area of distributed and parallel systems, with an emphasis on designing, prototyping and experimentally evaluating novel algorithms and tools for self-managing, self-adaptive and self-healing behavior in data centers and Clouds. She joined the Department of Electrical and Computer Engineering at University of Toronto in October 2003 as an Assistant Professor and became an Associate Professor in July 2009. She is actively collaborating with several industry partners, including Intel, NetApp, Bell Canada, and IBM through IBM T.J. Watson, Almaden and IBM Toronto Labs.

1 April 2015, 2:30 pm, DC 1302

Title: Building an Elastic Main-Memory Database: E-Store (PDF)
Speaker: Aaron Elmore, MIT and University of Chicago
Abstract: On-line transaction processing (OLTP) database management systems (DBMSs) often serve time-varying workloads due to daily, weekly or seasonal fluctuations in demand, or because of rapid growth in demand due to a company’s business success. In addition, many OLTP workloads are heavily skewed to “hot” tuples or ranges of tuples. To deal with such fluctuations, an OLTP DBMS needs to be elastic; that is, it must be able to expand and contract resources in response to load fluctuations and dynamically balance load as hot tuples vary over time. In this talk, I will present E-Store, an elastic partitioning framework for a distributed main-memory OLTP DBMSs. E-Store automatically scales resources in response to demand spikes, periodic events, and gradual changes in an application’s workload. E-Store addresses localized bottlenecks through a two-tier data placement strategy. This talk with also present Squall, the technique E-Store utilizes to update the data layout without bringing the system off-line. As E-Store is built upon an architecture that relies on a partitioned single-threaded execution model, Squall takes great care to balance data migration with transaction performance. Lastly, this talk will discuss on-going work and future challenges in building an elastic database.
Bio: Aaron J. Elmore is an Assistant Professor in the Department of Computer Science, and the College of the University of Chicago starting June 2015. Aaron is currently a Postdoctoral Associate at MIT working with Mike Stonebraker on elastic and multitenant database systems, and Sam Madden on the collaborative data analytic platform, DataHub. Aaron's thesis on Elasticity Primitives for Database-as-a-Service was completed at the University of California, Santa Barbara under the supervision of Divy Agrawal and Amr El Abbadi.

8 April 2015, 2:00 pm, DC 1302 (Please note the earlier time)

Title: Data Cleaning, Linking, and Integration: From Raw Data to Actionable Information
Speaker: Wang-Chiew Tan, University of California Santa Cruz
Abstract: Management and policy decisions are typically made based on information that is derived from datasets.  To prepare datasets so that they are ready to be analyzed requires a number of tedious and, oftentimes, manual data curation activities, such as data cleaning, linking, and integration with other datasets. 

In this talk, I will present some recent work on these three activities with particular emphasis on a recent work where we propose a query-oriented system for cleaning data with oracles.  Unlike prior data cleaning techniques where the focus has largely been to correct all existing data upfront, our framework is driven by the correctness of query results, cleans data only as needed, and also permits one to augment the underlying dataset through the identification of missing tuples in the result of a query.  Incorrect/missing tuples are removed/added to the result of a query through edits that are applied to the underlying dataset, where the edits are derived by interacting with domain experts which we model as oracle crowds.   

We show that the problem of determining minimal interactions with oracle crowds to derive database edits for removing/adding incorrect/missing tuples to the result of a query is NP-hard in general and present heuristic algorithms that interact with oracle crowds to progressively clean the dataset, as needed.  I will also present recent work on temporal record linkage and integration that allows one to identify which facts are temporally related and how they can be meaningfully combined together.

Bio: Wang-Chiew Tan is a Professor of Computer Science at University of California, Santa Cruz. She received her B.Sc. (First Class) in Computer Science from the National University of Singapore and her Ph.D. in Computer Science from the University of Pennsylvania. Her research interests are in general area of data management, with emphasis on topics such as (big) data integration and exchange, data provenance, crowdsourcing, and scientific databases. She is the recipient of an NSF CAREER award, a Google Faculty Award, and an IBM Faculty Award. She is the co-author of four best papers, a co-recipient of the 2014 ACM PODS Alberto O. Mendelzon Test-of-Time Award, and several of her publications have been invited to and appeared in special issues for selected papers. She has consistently served on the program committees of top database conferences. She was the program committee chair of the International Conference on Database Theory (ICDT) 2013, she is currently on the VLDB Board of Trustees, and she is the 2016 ACM Principles of Database Systems (PODS) program committee chair.

6 May 2015, 2:30 pm, DC 1302

Title: Performance Isolation in Multi-Tenant Relational Database-as-a-Service (PDF)
Speaker: Sudipto Das, Microsoft Research
Abstract:

Multi-tenancy and resource sharing are essential to make a Relational Database-as-a-Service (DaaS), such as Azure SQL Database, cost-effective. However, one major consequence of resource sharing is that the performance of one tenant's workload can be significantly affected by the resource demands of co-located tenants. In the SQLVM project at Microsoft Research, our approach to performance isolation in a DaaS is to isolate the key resources, such as CPU, I/O and memory, needed by the tenants' workload. The major challenge is in supporting this abstraction within a RDBMS for a wide variety of workloads and demands, without statically allocating resources, while ensuring low overheads and scaling to large numbers of tenants. Mechanisms designed in the SQLVM project are now in production and form an integral part of the Azure SQL Database Service Tiers and Performance Levels made generally available in September 2014.

More information about the project can be found at: http://research.microsoft.com/en-us/projects/sqlvm/.

Bio: Sudipto Das is a Researcher in the Data Management, Mining, and Exploration (DMX) group at Microsoft Research (MSR). He received his Ph.D. in Computer Science from University of California Santa Barbara (UCSB). His research interests are in the broad area of scalable, distributed, and multi-tenant DBMSs for cloud platforms. His dissertation work was in the area of building scalable and elastic transactional data stores, for which he received the 2013 ACM SIGMOD Jim Gray Doctoral Dissertation Award and UCSB's 2012 Lancaster Dissertation award. Dr. Das is also the recipent of the CIDR 2011 Best Paper Award and MDM 2011 Best Runner-up Paper Award.

20 July 2015, 1:30 pm, DC 1304

Title:

Steps towards HW/SW-DB-CoDesign (PDF)

Speaker: Wolfgang Lehner, Technische Universität Dresden
Abstract: Modern hardware has significant impact on database system architecture and algorithms considering advances on system level as well on the level of individual components. Within the seminar talk, I will present research results of two specific directions to leverage opportunities coming along with modern HW. On system level, I will describe the overall architecture of our main-memory centric storage engine strictly following the “Data ORiented Architecture” (DORA) paradigm to achieve high scalability as well as extreme elasticity on large NUMA systems. I will discuss the pros and cons as well as detailed design considerations. On component level, I will provide insights into our Tomahawk4DB project, a custom “database processor” providing instruction set extensions supporting core database operations like RID-list intersection/union or hash-support for efficient join operations.
Bio: Wolfgang Lehner is full professor and head of the database technology group at the TU Dresden, Germany. His research is dedicated to database system architecture specifically looking at crosscutting aspects from algorithms down to hardware-related aspects in main-memory centric settings. He is part of TU Dresden's excellence cluster with research topics in energy-aware computing, resilient data structures on unreliable hardware, and orchestration of heterogeneous systems; he is also a principal investigator of Germany's national "Competence Center for Scalable Data Services and Solutions" (ScaDS); Wolfgang also maintains a close research relationship with the SAP HANA development team. He serves the community in many PCs, is an elected member of the VLDB Endowment, serves on the review board of the German Research Foundation (DFG), and is an appointed member of the Academy of Europe.