# Database Seminar Series (2013-2014)

The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2012-2013 are below.

The talks are usually held on a Wednesday at 2:30 pm in room DC (Davis Centre) 1302.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.

The Database Seminar Series is supported by Sybase An SAP Company.

 Davood Rafiei Umeshwar Dayal Tim Kraska Mark Fox Paul Larson Fei Chiang Timothy Roscoe Alexandros Labrinidis Bettina Kemme Radu Sion

## 5 September 2013, 2:30 pm, DC 1304

 Title: Efficiently querying natural language text (PDF) Speaker: Davood Rafiei, University of Alberta Abstract: Unstructured text has long coexisted with relational data but has not been much treated as a first class citizen inside a relational database. Given that a large portion of text produced everyday in the public domain is primarily for human consumption, the data is often available in some form of natural language text. Searching such sources in granularities that are smaller than a document has been a challenge; yet querying for more factual information and relationships, or relating the encoded facts to more structured data provides many clear advantages. In this talk, we give an overview of the work on natural language text as it pertains to querying. We showcase some of our work on the subject including some of the building blocks we have been working on querying, indexing and resolving queries. We conclude by presenting some of the challenges and open directions. Bio: Davood Rafiei did his undergrad work at Sharif, his M.Sc. in Waterloo and his PhD in Toronto. He joined the University of Alberta in 2000, where he is now Associate Professor of Computer Science and member of the Database Systems Research Group. Davood has served in the program committees of both database conferences such as SIGMOD and VLDB and Web conferences such as WWW. His areas of interest also span over databases and the Web and include integrating natural language text with relational data, Web information retrieval and similarity queries and indexing. Davood was a visiting scientist at Google (Mountain View) for a year between 2007-2008.

## 16 October 2013, 2:30 pm, DC 1302

 Title: 1700 and PLANET: A 1-Phase Commit Protocol and Programming Model for Geo-Replicated Applications Speaker: Tim Kraska, Brown University Abstract: Replicating data across multiple data centers does not only allow moving the data closer to the user and, thus, reducing latency for applications, but also increases the availability in the event of a data-center failure. It is therefore not surprising that companies like Google, Yahoo, and Netflix already replicate user data across geographically different regions. Replication across data centers, however, is expensive. Inter-data center network delays are in the hundreds of milliseconds and vary significantly. Synchronized wide-area replication is therefore often considered to be infeasible with strong consistency and current solutions either settle for asynchronous replication which implies the risk of losing data in the event of failures, restrict consistency to very small partitions or give up consistency entirely. In this talk, I will first describe MDCC (Multi-Data Center Consistency), an optimistic 1-Phase Commit Protocol, that does not require a master or static partitioning, and can be used to achieve strong consistent replication at a cost similar to eventually consistent protocols. Afterwards, I will present our Predictive Latency-Aware NEtworked Transactions (PLANET), a new transaction programming model which empowers the application developer to cope with longer and unpredictable latencies caused by inter-data center communication. Bio: Tim Kraska is an Assistant Professor in the Computer Science department at Brown University. Currently, his research focuses on Big Data management in the cloud and hybrid human/machine database systems. Before joining Brown, Tim Kraska spent 2 years as a PostDoc in the AMPLab at UC Berkeley after receiving his PhD from ETH Zurich, where he worked on transaction management and stream processing. He was awarded a Swiss National Science Foundation Prospective Researcher Fellowship (2010), a DAAD Scholarship (2006), a University of Sydney Master of Information Technology Scholarship for outstanding achievement (2005), the University of Sydney Siemens Prize (2005), a VLDB best demo award (2011) and an ICDE best paper award (2013).

## 30 October 2013, 2:30 pm, DC 1302

 Title: An Ontology for Global City Indicators (PDF) Speaker: Mark Fox, University of Toronto Abstract: Cities are moving towards policy-making based on data. Today there are thousands of different sets of city performance indicators and hundreds of agencies compiling and reviewing them. However, these indicators are usually not standardized, consistent or comparable (over time or across cities). In response to this challenge, the Global City Indicator Facility was created by the World Bank at the University of Toronto, to define a set of city indicators that can be consistently applied globally. Over 250 cities worldwide are participating in this effort. This seminar describes the effort to create an ontology for city indicators in particular, and systems metrics in general. The ontology integrates over 10 ontologies from across the semantic web, including geonames, measurement theory, statistics, time, provenance, validity and trust. It extends these ontologies, where appropriate, to satisfy the ontology’s competency requirements. The ontology is defined in OWL, and implemented in a prolog RDF server. In addition, a set of consistency axioms are defined and implemented to perform tests not possible using the OWL axiomatization. Bio: Dr. Fox received his BSc in Computer Science from the University of Toronto in 1975 and his PhD in Computer Science from Carnegie Mellon University in 1983. In 1979 he was a founding member of the Robotics Institute of Carnegie Mellon University as well as the founding Director of the Intelligent Systems Laboratory within the Institute. He co-founded Carnegie Group Inc. in 1984, a software company that specialized in knowledge-based systems for solving engineering, manufacturing, and telecommunications problems, and was its Vice-President of Engineering and President/CEO. Carnegie Mellon University appointed him Associate Professor of Computer Science and Robotics in 1987 (with tenure in 1991). In 1988 he was the founding Director of the Center for Integrated Manufacturing Decision Systems at Carnegie Mellon. In 1991, Dr. Fox returned to the University of Toronto where he was appointed the NSERC Research Chairholder in Enterprise Integration and Professor of Industrial Engineering and Computer Science. In 1992, he was appointed Director of the Collaborative Program in Integrated Manufacturing. In 1993, Dr. Fox co-founded and was CEO Novator Systems Ltd., a pioneer in E-Retail software and services. Dr. Fox's research led to the creation of the field of Constraint-Directed Scheduling within Artificial Intelligence, and several commercially successful scheduling systems and companies. He also pioneered the application of Artificial Intelligence to project management, simulation, and material design. He was the designer of one of the first commercial industrial applications of expert systems: PDS/GENAID, a steam turbine and generator diagnostic system for Westinghouse, which was a recipient of the IR100 in 1985 and is still in commercial use at Siemens. He was the co-creator of the Knowledge Representation SRL from which Knowledge Craft™ and ROCK™, commercial knowledge engineering tools, were derived, and KBS from which several commercial knowledge based simulation tools were derived. His current research focuses on the ontologies, common sense reasoning and their application to Smart Cities. Dr. Fox was elected a Fellow of Association for the Advancement of Artificial Intelligence in 1991, a Joint Fellow of the Canadian Institute for Advanced Research and PRECARN in 1992, and a Fellow of the Engineering Institute of Canada in 2009. He is a past AAAI councillor, and a member of ACM and IEEE. Dr. Fox has published over 150 papers.

## 20 November 2013, 2:30 pm, DC 1302

 Title: Evolving the Architecture of SQL Server (PDF) Speaker: Paul Larson, Microsoft Research Abstract: The major commercial database systems were designed primarily for OLTP workloads and under the assumption that processors are slow, memory is scarce, and data lives on disk. These assumptions are no longer valid: OLAP workloads are now as common as OLTP workloads, multi-core processors are the norm, large memories are affordable, and frequently accessed data lives in the main memory buffer pool. So how can a vendor with a mature database system exploit the opportunities offered by these changes? The only realistic option is to gradually evolve the architecture of the system. SQL Server has begun this journey by adding two features: column store indexes to speed up OLAP-type queries and Hekaton, a new OLTP engine optimized for large memories and multicore processors. The talk will outline the design of these features, the main goals and constraints, and discuss the reasoning behind the design choices made. Bio: Paul (Per-Åke) Larson has conducted research in the database field for over 30 years. He was a Professor in the Department of Computer Science at the University of Waterloo for 15 years and joined Microsoft Research in 1996 where he is a Principal Researcher. Paul has worked in a variety of areas: file structures, materialized views, query processing, and query optimization among others. During the last few years he has collaborated closely with the SQL Server team on how to evolve the architecture of the core database engine for modern hardware.

## 22 January 2014, 2:30 pm, DC 1302

 Title: Continuous Data Cleaning Speaker: Fei Chiang, McMaster University Abstract: In declarative data cleaning, data semantics are encoded as constraints and errors arise when the data violates the constraints. Various forms of statistical and logical inference can be used to reason about and repair inconsistencies (errors) in the data. Recently, unified approaches that repair both errors in data and errors in semantics (the constraints) have been proposed. However, both data-only approaches and unified approaches are by and large static in that they apply cleaning to a single snapshot of the data and constraints. In this talk, I will present a continuous data cleaning framework that can be applied to dynamic data and constraint environments. Our approach permits both the data and its semantics to evolve and suggests repairs based on the accumulated evidence to date. Importantly, our approach uses not only the data and constraints as evidence, but also considers the past repairs chosen and applied by a user (user repair preferences). I will then describe details of a repair classifier that predicts the type of repair needed to resolve an inconsistency and learns from past user repair preferences to recommend more accurate repairs in the future. Bio: Fei Chiang is an Assistant Professor in the Department of Computing and Software at McMaster University. Her research interests are broadly in the area of data management, with a focus on data quality, data cleaning, data privacy, and information extraction. She received her M. Math from the University of Waterloo, and B.Sc and PhD degrees from the University of Toronto, all in Computer Science.  She has worked at IBM Global Services, in the Autonomic Computing Group at the IBM Toronto Lab, and in the Data Management, Exploration and Mining Group at Microsoft Research.

## 6 May 2014, 11:00am, DC 1304 (Joint with Systems Group; note special time and place)

 Title: Treating cores as devices Speaker: Timothy Roscoe, ETH Zürich Abstract: Power management, dark silicon, and partial failures mean that, in the future, computer hardware will most likely consist of a dynamically- changing set of heterogeneous processor cores. Contemporary operating system structures were not designed with this hardware model in mind, and have difficulty adapting to relatively simple concepts such as processor hotplug. Our work on meeting this challenge in the Barrelfish research OS has led us to treat cores as much as possible (but not entirely) like any other devices in the system. Several novel ideas make this possible: aside from the multikernel architecture itself, we leverage the externalization of kernel state through capabilities, and the concept of a "boot driver", which is the equivalent of a device driver for a processor core. In this talk I will present our framework for managing a changing set of cores in a multikernel OS, and some of the surprising consequences: individual kernels can be rebooted, replaced, or upgraded on the fly, cores and hardware threads can be temporarily turned into coprocessors and back again, and per-core OS state can be quickly moved around the hardware to minimize energy usage or enforce performance guarantees. Bio: Timothy Roscoe is a Professor in the Systems Group of the Computer Science Department at ETH Zurich. He received a PhD from the Computer Laboratory of the University of Cambridge, where he was a principal designer and builder of the Nemesis operating system, as well as working on the Wanda microkernel and Pandora multimedia system.  After three years building web-based collaboration systems at a startup company in North Carolina, Mothy joined Sprint's Advanced Technology Lab in Burlingame, California, working on application hosting platforms and networking monitoring.  Mothy joined Intel Research at Berkeley in April 2002 as a principal architect of PlanetLab, an open, shared platform for developing and deploying planetary-scale services. In September 2006 he spent four months as a visiting researcher in the Embedded and Real-Time Operating Systems group at National ICT Australia in Sydney, before joining ETH Zurich in January 2007.  He is a recipient of a 2013 ACM SIGCOMM 10-year test-of-time award, and a 2014 Usenix NSDI 10-year test-of-time award, both for his work on PlanetLab.  In 2014 he was made an ACM Fellow for his contributions to operating systems and networking research.  His current research interests include operating systems for heterogeneous multicore systems, and network architecture.

## 21 May 2014, 2:30 pm, DC 1302

 Title: Handling Big Streaming Data with DILoS (PDF) Speaker: Alexandros Labrinidis, University of Pittsburgh Abstract: For the past few years, our group has been working on problems related to Big Data through several projects. After briefly discussing these projects, the rest of this talk will present DILoS, which focuses on load management for Big Streaming Data.'' Today, the ubiquity of sensing devices as well as of mobile and web applications continuously generates a huge amount of data in the form of streams, which need to be continuously processed and analyzed, to meet the near-real-time requirements of monitoring applications. Such processing happens inside Data stream management systems (DSMSs), which efficiently support continuous queries (CQs). CQs inherently have different levels of criticality and hence different levels of expected quality of service (QoS) and quality of data (QoD). In order to provide different quality guarantees, i.e., service level agreements (SLAs), to different client stream applications, we developed DILoS, a novel framework that exploits the synergy between scheduling and load shedding in DSMS. In overload situations, DILoS enforces worst-case response times for all CQs while providing prioritized QoD, i.e., minimize data loss for query classes according to their priorities. We further propose ALoMa, a new adaptive load manager scheme that enables the realization of the DILoS framework. ALoMa is a general, practical DSMS load shedder that outperforms the state-of-the-art in deciding when the DSMS is overloaded and how much load needs to be shed. We implemented DILoS in our real DSMS prototype system (AQSIOS) and evaluated its performance for a variety of real and synthetic workloads. Our experiments show that our framework (1) allows the scheduler and load shedder to consistently honor CQs' priorities and (2) maximizes the utilization of the system processing capacity to reduce load shedding. Bio: Dr. Alexandros Labrinidis received his Ph.D degree in Computer Science from the University of Maryland, College Park in 2002. He is currently an associate professor at the Department of Computer Science of the University of Pittsburgh and co-director of the Advanced Data Management Technologies Laboratory (ADMT Lab). He is also an adjunct associate professor at Carnegie Mellon University (CS Dept). Dr. Labrinidis' research focuses on user-centric data management for scalable network-centric applications, including web-databases, data stream management systems, sensor networks, and scientific data management (with an emphasis on big data). He has published over 70 papers at peer-reviewed journals, conferences, and workshops; he is the recipient of an NSF CAREER award in 2008. Dr. Labrinidis served as the Secretary/Treasurer for ACM SIGMOD and as the Editor of SIGMOD Record. He is currently on the editorial board of the Parallel and Distributed Databases Journal. He has also served on numerous program committees of international conferences/workshops; in 2014, he was the PC Track-Chair for Streams and Sensor Networks for the ICDE conference. DILoS was developed in collaboration with Thao N.Pham (as part of her PhD thesis) and Panos K. Chrysanthis who is the director of the ADMT lab. This work has been funded in part by two NSF Awards and a gift from EMC/Greenplum.

## 22 May 2014, 2:30 pm, DC 1331

 Title: Multiplayer Games: the perfect application to explore scalable and secure distributed replica management Speaker: Bettina Kemme, McGill University Abstract: Multiplayer Online Games (MOGs) are an extremely popular online technology, one that produces billions of dollars in revenues. The underlying architecture of game engines is distributed by nature and has to maintain large amounts of quickly changing state. In particular, each client has its own partial view of a continuously evolving virtual world, and all these client copies have to be kept up-to-date. In this talk, I will present an overview of current game architectures, from client-server to peer-to-peer architectures, and outline possible solutions to several challenges that one faces when trying to meet the scalability, response time and low cost requirements of multiplayer game engines: distributed state maintenance, scalable update dissemination, and the avoidance or detection of malicious cheating behaviour. Bio: Bettina Kemme is an Associate Professor at the School of Computer Science of McGill University, Montreal, where she leads the distributed information systems lab. She holds a PhD degrees in Computer Science from ETH Zurich, and an undergraduate degree from the Friedrich-Alexander-Universitaet Erlangen, Germany. Bettina has published over 70 publications in major journals and conferences in the areas of database systems and distributed systems as well as served on the program committee and as area chair of the major database and distributed systems conferences such as SIGMOD, VLDB, ICDE, ICDCS, Middleware and many more. Her research focuses on large-scale distributed data management with a main focus on data consistency and data dissemination.

## 18 June 2014, 2:30 pm, DC 1302

 Title: Modern Secure Data Management Speaker: Radu Sion, Stony Brook University Abstract: Digital societies and markets increasingly mandate consistent procedures for the access, processing and storage of information. In the United States alone, over 10,000 such regulations can be found in financial, life sciences, health care and government sectors, including the Gramm-Leach-Bliley Act, the Health Insurance Portability and Accountability Act, the Sarbanes-Oxley Act, etc. A recurrent theme in these regulations is the need for regulatory compliant storage as an underpinning to ensure data confidentiality, access integrity and authentication; provide audit trails, guaranteed deletion, and data migration. However, without the availability of practical, technology-backed enforcement solutions full regulatory compliance cannot be realized. In this work we posit that the seemingly contradictory requirements of security, efficiency and low cost can in fact be reconciled gracefully via intelligent deployment of cryptographic and system security constructs. To this end we design and prototype a number of fully functional relational database and file systems, addressing data privacy, query authentication and data retention, while offering increased functionality, higher efficiency and lower costs. Bio: Radu is an Associate Professor of Computer Science at Stony Brook University (on leave) and currently the CEO of Private Machines Inc. He remembers when gophers were digging through the Internets and bits were running at slower paces of 512 per second. He is also interested in efficient computing with a touch of cyber-security paranoia, raising rabbits on space ships and sailing catamarans of the Hobie variety.