# Data Systems Seminar Series (2015-2016)

The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.

The talks are usually held on a Monday at 10:30 am in room DC (Davis Centre) 1302. Exceptions are flagged.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.

The Database Seminar Series is supported by

 Nesime Tatbul Ankur Goyal Andy Pavlo Shane Culpepper Stephen Green Frank McSherry Ricardo Baeza-Yates Shivakumar Vaithyanathan Kevyn Collins-Thompson Ellen Voorhees Jay Aslam

## 5 October 2015, 10:30 am, M3 3127 (Please note special location)

 Title: S-Store: A Streaming NewSQL System for Big Velocity Applications (PDF) Speaker: Nesime Tatbul, Intel Labs and MIT Abstract: Managing high-speed data streams in real time has become an integral part of today’s big data applications. In a significant portion of these applications, we see a critical need for real-time stream processing to co-exist with transactional state management due to the presence of shared mutable state. Yet, existing systems treat streaming and transaction processing as two separate computational paradigms, which makes it difficult to build such applications to execute correctly and scalably. S-Store is a new data management system that provides a single, scalable platform for processing streams and transactions. S-Store takes its architectural foundation from H-Store - a modern distributed main-memory OLTP ("NewSQL") system, and adds well-defined primitives to support data-driven processing such as streams, windows, triggers, and dataflow graphs. Furthermore, it makes a number of careful extensions to H-Store's traditional transaction model in order to maintain correctness guarantees in the presence of data and processing dependencies among transaction executions that involve streams. These guarantees include ACID, ordered execution, and exactly-once processing. In this talk, I will present S-Store's design and implementation, and show how S-Store can ensure transactional integrity without sacrificing performance. Bio: Nesime Tatbul is a senior research scientist at the Intel Science and Technology Center for Big Data based at MIT CSAIL. Before joining Intel Labs, she was a faculty member at the Computer Science Department of ETH Zurich. She received her B.S. and M.S. degrees in Computer Engineering from the Middle East Technical University (METU), and her M.S. and Ph.D. degrees in Computer Science from Brown University. During her graduate school years at Brown, she also worked as a research intern at the IBM Almaden Research Center, and as a consultant for the U.S. Army Research Institute of Environmental Medicine (USARIEM). Her research interests are in database systems, with a recent focus on data stream processing and distributed data management. She is the recipient of an IBM Faculty Award in 2008, a Best System Demonstration Award at the ACM SIGMOD 2005 Conference, and both the Best Poster Award and the Grand Challenge Award at the ACM DEBS 2011 Conference. She has served on the program committee for various conferences including ACM SIGMOD (as an industrial program co-chair in 2014 and as a group leader in 2011), VLDB, and IEEE ICDE (as a PC track chair for Streams, Sensor Networks, and Complex Event Processing in 2013). She has chaired a number of VLDB co-located workshops including the International Workshop on Data Management for Sensor Networks (DMSN) and the International Workshop on Business Intelligence for the Real-Time Enterprise (BIRTE). Her recent editorial duties include PVLDB (associate editor, Volume 5, 2011-2012) and ACM SIGMOD Record (associate editor, Research Surveys Column, since June 2012).

## 26 October 2015, 10:30 am, DC 1304

 Title: Key Innovations in MemSQL Speaker: Ankur Goyal, MemSQL Abstract: This talk will cover the major architectural design decisions with discussion on specific technical details as well as the motivation behind the big decisions. We will cover lock free skip lists, code generation, durability/replication, distributed query execution, and clustering in MemSQL. We will then discuss some of the new directions for the product and how these features position MemSQL uniquely in the market. Bio: Ankur Goyal is the VP of Engineering at MemSQL. At MemSQL he has focused on distributed query execution and fault-tolerant clustering, but has touched most of the engine. His areas of interest are distributed systems, compilers, and operating systems. Ankur studied computer science at Carnegie Mellon University and worked on distributed data processing at Microsoft before MemSQL.

## 2 November 2015, 10:30 am, DC 1302

 Title: I Don't Want to be the Mitt Romney of Databases (PDF) Speaker: Andy Pavlo, Carnegie Mellon University Abstract: What can I say? Yes, I helped build a database management system (DBMS) for the "one percent." This previous system (H-Store) is able to get up to 40x higher throughput over traditional, disk-oriented DBMSs for on-line transaction processing workloads. But getting this great performance requires a significant upfront deployment cost (e.g., application rewriting, pre-partitioning). It is also unable to perform non-trivial analysis operations without the use of a separate data warehouse, which further increases costs and overhead. This makes a DBMS like ours accessible to only those organizations with ample resources. In this talk, I outline our vision for a new distributed DBMS (codenamed "Peloton") that we are building at CMU that is truly for the 99%. It will enable any application to get the same kind of performance as a specialized system like H-Store without any expensive setup or maintenance. The crux of the system is to employ machine learning techniques to support the efficient execution of hybrid workloads (transactions + analytics) through intelligent pre-fetching and automatic partitioning/tuning. In essence, our new DBMS is able to learn about how an application uses the database without any human intervention and reconfigure itself accordingly. Bio: Andy Pavlo is an Assistant Professor of Databaseology in the Computer Science Department at Carnegie Mellon University.

## 9 November 2015, 10:30 am, DC 1302

 Title: Efficient Location-aware Web Search (PDF) Speaker: Shane Culpepper, RMIT Abstract: Mobile search is quickly becoming the most common mode of search on the internet. This shift is driving changes in user behaviour, and search engine behaviour. Just over half of all search queries from mobile devices have local intent, making location-aware search an increasingly important problem. In this work, we explore the efficiency and effectiveness of two general types of geographical search queries, range queries and $k$ nearest neighbor queries, for common web search tasks. We test state-of-the-art spatial-textual indexing and search algorithms for both query types on two large datasets. Finally, we present a rank-safe dynamic pruning algorithm that is simple to implement and use with current inverted indexing techniques. Our algorithm is more efficient than the tightly coupled best-in-breed hybrid indexing algorithms that are commonly used for top-$k$ spatial textual queries, and more likely to find relevant documents than techniques derived from range queries. Bio: Shane Culpepper is an ARC DECRA Research Fellow and Senior Lecturer at RMIT University in Melbourne, Australia. He completed a PhD in Computer Science at The University of Melbourne in 2008 under the supervision of Alistair Moffat. His research focuses primarily on designing efficient and scalable algorithms for a wide variety of information storage and retrieval problems. Research interests include information retrieval, text indexing, data compression, experimental algorithmics, and natural language processing. For more information, visit his homepage at http://www.culpepper.io.

## 11 January 2016, 2:00 pm, DC 1302 (Please note unusual time)

 Title: Research in Information Retrieval and Machine Learning at Oracle Labs Speaker: Stephen Green, Oracle Labs Abstract: This talk will describe current and past research in Information Retrieval and Machine Learning at Oracle Labs. Along the way we will talk about research at Oracle Labs in general, about work on scalable machine learning, feature selection, and sentiment analysis, and about what it is like to do research in an industrial setting. Bio: Stephen Green is a Consulting Member of Technical Staff at Oracle Labs in Burlington, Massachusetts, where he is the Principal Investigator of the Information Retrieval and Machine Learning project. He is the chief architect and implementer of the Minion search engine Search Engine, a high-performance, open source Java search engine incorporating techniques from information retrieval, natural language processing, and knowledge representation.

## 11 April 2016, 10:30 am, DC 1302

 Title: Next-generation Data-parallel Dataflow Systems (PDF) Speaker: Frank McSherry Abstract: The Naiad project at Microsoft Research introduced a new model of dataflow computation, timely dataflow, which was designed to support low-latency computation in data-parallel dataflow graphs containing structured cycles. This model substantially enlarged the space of data-parallel computations that can be reasonably expressed, as compared to other modern “big data” systems. Naiad achieved excellent performance it its intended application domains, largely by providing the dataflow operators with meaningful and low-overhead coordination primitives, but otherwise staying out of their way. In this talk we will discuss performance issues with existing systems, review timely dataflow, and present a new data-parallel design that coordinates less frequently yet more accurately. The design is implemented in Rust and is available at https://github.com/frankmcsherry/timely-dataflow, and currently out-performs several popular distributed systems even when run on the speaker's laptop. This talk reflects work done jointly with Derek Murray, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi. Bio: Frank McSherry is an independent researcher formerly affiliated with Microsoft Research, Silicon Valley. While there he led the Naiad project, which introduced both differential and timely dataflow, and remains one of the top-performing big data platforms. He also works with differential privacy, due in part to its interesting relationship to data-parallel computation. Frank currently enjoys spending his time in places other than Silicon Valley.

## 18 April 2016, 2:30 pm, DC 1302

 Title: Data and Algorithmic Bias in the Web Speaker: Ricardo Baeza-Yates, UPF, Spain & UChile Abstract: The Web is the largest public big data repository that humankind has created. In this overwhelming data ocean we need to be aware of the quality and in particular, of biases the exist in this data, such as redundancy, spam, etc. These biases affect the algorithms that we design to improve the user experience. This problem is further exacerbated by biases that are added by these algorithms, specially in the context of search and recommendation systems. They include ranking bias, presentation bias, position bias, etc. We give several examples and their relation to sparsity, novelty, and privacy, stressing the importance of the user context to avoid these biases. Bio: Ricardo Baeza-Yates areas of expertise are information retrieval, web search and data mining, data science and algorithms. He was VP of Research at Yahoo Labs, based in Barcelona, Spain, and later in Sunnyvale, California, from January 2006 to February 2016. He is part time Professor at DTIC of the Universitat Pompeu Fabra, in Barcelona, Spain. Until 2004 he was Professor and founding director of the Center for Web Research at the Dept. of Computing Science of the University of Chile. He obtained a Ph.D. in CS from the University of Waterloo, Canada, in 1989. He is co-author of the best-seller Modern Information Retrieval textbook published by Addison-Wesley in 2011 (2nd ed), that won the ASIST 2012 Book of the Year award. From 2002 to 2004 he was elected to the board of governors of the IEEE Computer Society and in 2012 he was elected for the ACM Council. Since 2010 is a founding member of the Chilean Academy of Engineering. In 2009 he was named ACM Fellow and in 2011 IEEE Fellow, among other awards and distinctions.

## 9 May 2016, 10:30 am, DC 1302

 Title: Watson Content Services: Creation, Maintenance and Consumption of Knowledge Bases Speaker: Shivakumar Vaithyanathan, IBM Almaden Research Center Abstract: In this talk I will describe a scalable ontology-driven infrastructure for the creation, maintenance and consumption of knowledge bases from multiple (un/semi)structured data sources. The purpose of this infrastructure is to support the next generation of applications based on insights derived from public, licensed and data sometimes referred to as dark (primarily for effect). I will describe the design and current status of the Content Services platform for (a) scalable infrastructure for continuous large-scale analysis of multiple (un/semi) structured sources to create an integrated view of entities, relationships and events including support for incremental processing, flow automation, monitoring, failure recovery and versioning (b) flexible knowledge representation and querying over the structured representation of enriched data. At appropriate points in the talk I will discuss research challenges describing briefly the current work in IBM Research to address these challenges. Bio: Shivakumar Vaithyanathan is an IBM Fellow and Director, Watson Content Services. Prior to that he started and managed the Analytics Department at IBM Almaden. Multiple technologies developed under his direction ship with several IBM products as well as released in open-source. He has co-authored more than 40 papers in major conferences including, ACL, EMNLP, SIGMOD, VLDB, ICML, NIPS and UAI.

## 16 May 2016, 10:30 am, DC 1302

 Title: Connecting Searching with Learning (PDF) Speaker: Kevyn Collins-Thompson, University of Michigan Abstract: While search engines are widely used to find educational material, current search technology is optimized to provide information of generic relevance, not results that are oriented toward a user's learning goals. As a result, users often do not get effective access to the materials best suited for their specific learning needs. Moreover, little is known about the relationship between search interaction over time and actual learning outcomes. With collaborators, I have been exploring new content representations, implicit assessment methods, interaction features, and retrieval algorithms for search engines toward better understanding and support of human learning, broadly defined. This talk will summarize progress from recent projects oriented toward that goal, including a study of search ranking algorithms that incorporate learning-related features such as reading difficulty and concept density, and user studies exploring the relationship between search interaction patterns and learning outcomes. Bio: Kevyn Collins-Thompson is an Associate Professor of Infomation and Computer Science at the University of Michigan. His research explores algorithms and software systems for optimally connecting people with information, especially toward educational goals. His research on personalization has been applied to real-world systems ranging from intelligent tutoring systems to Web search engines. Kevyn has also pioneered techniques for modeling the reading difficulty of text, and understanding and supporting how people learn language. He received his Ph.D. from the School of Computer Science at Carnegie Mellon University. and B.Math in Computer Science from the University of Waterloo. Before joining the University of Michigan in 2013 he was a researcher in the Context, Learning, and User Experience for Search (CLUES) group at Microsoft Research.

## 7 July 2016, 2:00 pm, DC 2585 (Please note unusual room and time)

 Title: Using Replicates in Information Retrieval Evaluation (PDF) Speaker: Ellen Voorhees, National Institute of Standards and Technology Abstract: The goal of the test collection methodology used in information retrieval research is to be able to make reliable conclusions regarding the role the differences in retrieval systems have on the difference in retrieval evaluation scores---i.e., to measure the system effect. The difficulty is that evaluation scores vary for more reasons than simply the differences in retrieval systems, including the particular information need (or topic) to be satisfied, whose effect is frequently larger than the system effect. This work explores a method for more accurately estimating the main effect of the system in a typical test-collection-based evaluation thereby increasing the sensitivity of system comparisons. Randomly partitioning the test document collection allows for multiple tests of a given system and topic (replicates). Bootstrap ANOVA can use these replicates to extract system-topic interactions, something not possible without replicates, yielding a more precise value for the system effect and a narrower confidence interval around that value. Experiments using multiple TREC collections demonstrate that removing the topic-system interactions substantially reduces the confidence intervals around the system effect as well as increases the number of significant pairwise differences found. Further, the method is robust against small changes in the number of partitions used, against variability in the documents that constitute the partitions, and the measure of effectiveness used to quantify system effectiveness. Bio: Ellen Voorhees is a computer scientist at the U.S. National Institute of Standards and Technology where her primary responsibility is directing the Text REtrieval Conference (TREC) project. Her research focuses on developing and validating appropriate evaluation schemes to measure system effectiveness for diverse user search tasks. Voorhees received her PhD in computer science from Cornell University and was granted three patents on her work on information access while a member of the technical staff at Siemens Corporate Research.

## 12 July 2016, 2:00 pm, DC 1304 (Please note unusual room and time)

 Title: ML for IR: Sentiment Analysis and Multi-label Categorization (PDF) Speaker: Jay Aslam, Northeastern University Abstract: We consider two problems in information retrieval, sentiment analysis and multi-label categorization, and we explore the use of machine learning techniques to solve each of these problems. In sentiment analysis, we demonstrate the utility of skip-gram features and the use of L1 and L2 regularization within machine learning in order to effectively accomplish feature selection and predictive generalization. In multi-label categorization, where one must assign an object such as a text document to an appropriate subset of possible labels, we introduce a new technique based on conditional Bernoulli mixtures and demonstrate its utility on a number of benchmark data sets. Bio: Jay Aslam is a Professor and Associate Dean of Faculty in the College of Computer and Information Science at Northeastern University. Prior to joining Northeastern University, he was on faculty at Dartmouth College. Prof. Aslam obtained his PhD in Computer Science from MIT, and he held a postdoctoctoral position at Harvard University. Prof. Aslam's research interests include information retrieval, machine learning, and the design and analysis of algorithms. In machine learning, he has developed models and algorithms for learning in the presence of noisy or erroneous training data, and he has explored the use of machine learning to solve problems in transportation, computer security, wireless networking, human computation, and medical informatics. In information retrieval, he has applied techniques from machine learning, statistics, information theory, and social choice theory to develop algorithms for efficient search engine training and evaluation, metasearch, automatic information organization, and learning-to-rank. Prof. Aslam served as the General co-Chair for the 2009 ACM SIGIR Conference on Research and Development in Information Retrieval, and he currently serves as the Program co-Chair for SIGIR 2016.