Data Systems Seminar Series (2017-2018)

The Data Systems Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.

The talks are usually held on a Monday at 10:30 am in room DC 1302. Exceptions are flagged.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.


The Data Systems Seminar Series is supported by


Benny Kimelfeld
Heng Ji
Aditya Parameswaran
Rumi Chunara
Lei Zou
Barzan Mozafari
Rachel Pottinger
Daniel Lemire
Panos Ipeirotis
Torben Bach Pedersen
Paolo Atzeni

5 September 2017, 2:30 pm, DC 1331 (Please note special time and place)

Title: Enumerating Tree Decompositions: Why and How
Speaker: Benny Kimelfeld, Technion
Abstract:

Many intractable problems on graphs have efficient solvers when graphs are trees or forests. Tree decompositions often allow to apply such solvers to general graphs by grouping nodes into bags laid out in a tree structure, thereby decomposing the problem into the sub-problems induced by the bags. This approach has applications in a plethora of domains, partly because it allows the optimize inference on probabilistic graphical models, as well as evaluation of database queries.

Nevertheless, a graph can have exponentially many tree decompositions and finding an ideal one is challenging, for two main reasons. First, the measure of goodness often depends on subtleties of the specific application at hand. Second, theoretical hardness is met already for the simplest measures such as the maximal size of bag (a.k.a. “width”). Therefore, we explore the approach of producing a large space of high-quality tree decompositions for the application to choose from. 

I will describe our application of tree decompositions in the context of “worst-case optimal” joins — a new breed of in-memory join algorithms that satisfy strong theoretical guarantees and were found to feature a significant speedup compared to traditional approaches. Specifically, I will explain how this development led us to the challenge of enumerating tree decompositions. Then, I will describe a novel enumeration algorithm for tree decompositions with a theoretical guarantee on the delay (the time between consecutive answers), and an experimental study thereof (on graphs from various relevant domains). Finally, I will describe recent results that provide guarantees on both the delay and the quality of the generated tree decompositions.

The talk will be based on papers that appeared in EDBT 2017 and PODS 2017, co-authored with Nofar Carmeli, Yoav Etsion, Oren Kalinsky and Batya Kenig.
Bio:

Benny Kimelfeld is an Associate Professor at Technion, Israel.  In the past he has been at LogicBlox and at IBM Research – Almaden. His research interests are around aspects of data management, such as database theory and systems, algorithms for query evaluation, information extraction, information retrieval, data mining, and database uncertainty.

He received his Ph.D. in Computer Science from The Hebrew University of Jerusalem, under the supervision of Prof. Yehoshua Sagiv.

16 October 2017, 10:30 am, DC 1304 (Please note unusual room)

Title: Universal Information Extraction
Speaker: Heng Ji, Rensselaer Polytechnic Institute
Abstract:

The goal of Information Extraction (IE) is to extract structured facts from a wide spectrum of heterogeneous unstructured data types including texts, speech, images and videos. Traditional IE techniques are limited to a certain source X (X = a particular language, domain, limited number of pre-defined fact types, single data modality...). When we move from X to a new source Y, we need to start from scratch again by annotating a substantial amount of training data and developing Y specific extraction capabilities.

We propose a new Universal Information Extraction (IE) paradigm to combine the merits of traditional IE (high quality and fine granularity) and Open IE (high scalability). This framework aims to discover schemas and extract facts from any input corpus, without any annotated training data or predefined schema. It can also be extended to multiple data modalities (images, videos) and 282 languages by constructing a common semantic space and transfer learning across sources.

Bio:

Heng Ji is Edward P. Hamilton Development Chair Professor in Computer Science Department of Rensselaer Polytechnic Institute. She received her Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Information Extraction and Knowledge Base Population.

She was selected as "Young Scientist" and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017. She received "AI's 10 to Watch" Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, Google Research Awards in 2009 and 2014, IBM Watson Faculty Award in 2012 and 2014, Bosch Research Awards in 2015 and 2016. She coordinated the NIST TAC Knowledge Base Population task since 2010. She is now serving as the Program Committee Co-Chair of NAACL2018.

2 November 2017, 10:30 am, DC 1302 (Please note unusual day)

Title:

Enabling Data Science for the 99% PDF icon video icon

Speaker: Aditya Parameswaran, University of Illinois-Urbana Champaign
Abstract:

There is a severe lack of interactive tools to help people manage, analyze, and make sense of large datasets. This talk will briefly cover three tools under development in our research group (with collaborators at Illinois, MIT, Maryland, and Chicago) that empower individuals and teams to perform interactive data analysis more effectively. The three tools span the spectrum of analyses types — from browsing with DataSpread, a spreadsheet-database hybrid, to exploration with ZenVisage, a effortless visualization recommendation tool, and finally to analysis and collaboration with Orpheus, a database system that supports versioning as a first-class citizen. 

Bio:

Aditya Parameswaran is an Assistant Professor in Computer Science at the University of Illinois (UIUC). He spent a year as a PostDoc at MIT CSAIL following his PhD at Stanford University, before starting at Illinois in August 2014. He develops systems and algorithms for "human-in-the-loop" data analytics, synthesizing techniques from database systems, data mining, and human computation.

Aditya received the NSF CAREER Award, the TCDE Early Career Award, the C. W. Gear Junior Faculty Award from Illinois, multiple "best" Doctoral Dissertation Awards (from SIGMOD, SIGKDD, and Stanford), an "Excellent" Lecturer award from Illinois, a Google Faculty award, the Key Scientific Challenges award from Yahoo!, and multiple best-of-conference citations. He is an associate editor of SIGMOD Record and serves on the steering committee of the HILDA (Human-in-the-loop Data Analytics) Workshop. His research group is supported with funding from the NSF, the NIH, Adobe, the Siebel Energy Institute, and Google.

4 December 2017, 10:30 am, DC 1302

Title: Citizen-Sourced Data for Public Health Modeling PDF icon video icon 
Speaker: Rumi Chunara, New York University
Abstract:

Knowledge generation through crowdsourcing is becoming increasingly possible and useful in many domain areas; yet requires new method development given the observational, unstructured and noisy nature of citizen-sourced data. In this talk I will discuss statistical and machine learning methods we are developing to integrate crowdsourced data into public health models. This includes, combining citizen-sourced and clinical data, accounting for biases, drawing inference from observational data, and generating relevant features. Examples will use empirical data from local and worldwide contexts.

Bio: Rumi Chunara is an Assistant Professor at New York University, jointly appointed in Computer Science and in Global Public Health. Her research interests combine data mining and machine learning with social and ubiquitous computing. Specifically she focuses on feature extraction from and statistical modeling of unstructured and observational personally-generated data — for epidemiological applications. She received her Ph.D. from MIT and was named an MIT Technology Review Innovator Under 35 in 2014.

20 April 2018, 10:30 am, DC 1304

Title:

Speedup Set Intersections in Graph Algorithms using SIMD Instructions PDF icon

Speaker: Lei Zou, Peking University
Abstract:

In this talk, I focus on accelerating a widely employed computing pattern — set intersection, to boost a group of relevant graph algorithms. Graph’s adjacency-lists can be naturally considered as node sets, thus set intersection is a primitive operation in many graph algorithms.

We propose QFilter, a set intersection algorithm using SIMD instructions. QFilter adopts a merge-based framework and compares two blocks of elements iteratively by SIMD instructions. The key insight for our improvement is that we quickly filter most of unnecessary comparisons in one byte-checking step. We also present a binary representation called BSR that represent sets in a compact layout. From the combination of QFilter and BSR, we achieve dataparallelism in two levels — inter-chunk and intra-chunk parallelism.

Furthermore, we find that node ordering impacts the performance of intersection in graph algorithms by affecting the compactness of BSR. We formulate the graph reordering problem as an optimization of the compactness of BSR, and prove its strong NP-completeness. Thus we propose an approximate algorithm that can find a better ordering and improve the performance by 39% on average. This work is accepted to SIGMOD 2018.

Bio:

Lei Zou is a Professor in the Institute of Computer Science and Technology (ICST) of Peking University (PKU). He joined PKU in 2009 after receiving his BS degree and Ph.D. degree in Computer Science at Huazhong University of Science and Technology (HUST) in 2003 and 2009, respectively.

He received a CCF (China Computer Federation) Doctoral Dissertation Nomination Award in 2009, won Second Class Prize of CCF Natural Science Award in 2014 and Second Class Prize of Natural Science of the Ministry of Education, China in 2017. During his PhD, Lei Zou visited Hong Kong University of Science and Technology in 2007 and University of Waterloo in 2008 as a visiting scholar.

His recent research interests include graph databases, knowledge graph, particularly in graph-based RDF data management. He has published more than 40 papers, including more than 30 papers published in reputed journals and major international conferences, such as SIGMOD, VLDB, ICDE, AAAI, TODS, TKDE, VLDB Journal. 

23 April 2018, 10:30 pm, DC 1302

Title:

Making Approximate Query Processing Mainstream: Progress and the Road Ahead

Speaker: Barzan Mozafari, University of Michigan
Abstract:

Approximate Query Processing (AQP) has been a subject of academic research for over 25 years now. However, until recently, it has had little success in terms of commercial adoption.

In talk, we explain the interface and deployment barriers that have historically slowed down the adoption of AQP by database vendors and enterprise users alike. We then discuss some of the recent advances that have successfully overcome some of these barriers. We also introduce several research directions and exciting opportunities that would not be possible in a database with precise answers.

In particular, we explore several opportunities at the intersection of statistics and data management, including our Database Learning vision — a database system that learns and becomes smarter over time — as well as novel abstractions for speeding up machine learning workloads through approximate operators and error-computation tradeoffs.

Bio:

Barzan Mozafari is a Morris Wellman Assistant Professor of Computer Science and Engineering at the University of Michigan, Ann Arbor, where he leads a research group designing the next generation of scalable databases using advanced statistical models. Prior to that, he was a Postdoctoral Associate at MIT. He earned his Ph.D. in Computer Science from UCLA in 2011.

His research career has led to several open-source projects, including DBSeer (an automated database diagnosis tool), BlinkDB (a massively parallel approximate query engine), and SnappyData (an HTAP engine that empowers Apache Spark with transactions and real-time analytics).

He has won the National Science Foundation CAREER award, as well as several best paper awards in ACM SIGMOD and EuroSys. He is the founder of Michigan Software Experts, and a strategic advisor to SnappyData, a company that commercializes the ideas introduced by BlinkDB.

26 April 2018, 10:30 am, DC 1304

Title: Improving Understanding and Exploration of Data by Non-Database Experts PDF icon video icon
Speaker: Rachel Pottinger, University of British Columbia
Abstract:

Users are faced with an increasing onslaught of data, whether it's in their choices of movies to watch, assimilating data from multiple sources, or finding information relevant to their lives on open data registries.

In this talk I discuss some of the recent and ongoing work about how to improve understanding and exploration of such data, particularly by users with little database background.

Bio: Rachel Pottinger is an Associate Professor in Computer Science at the University of British Columbia.  She received her PhD in computer science from the University of Washington in 2004. Her main research interest is data management, particularly semantic data integration, how to manage metadata, how to manage data that is currently not well supported by databases, and how to make data easier to understand and explore.

10 May 2018, 2:00 pm, DC 1304

Title: Next Generation Indexes For Big Data Engineering PDF icon
Speaker: Daniel Lemire, Université Télug
Abstract:

Maximizing performance in data engineering is a daunting challenge. We present some of our work on designing faster indexes, with a particular emphasis on compressed indexes. Some of our prior work includes (1) Roaring indexes which are part of multiple big-data systems such as Spark, Hive, Druid, Atlas, Pinot, Kylin, (2) EWAH indexes are part of Git (GitHub) and included in major Linux distributions.

We will present ongoing and future work on how we can process data faster while supporting the diverse systems found in the cloud (with upcoming ARM processors) and under multiple programming languages (e.g., Java, C++, Go, Python). We seek to minimize shared resources (e.g., RAM) while exploiting algorithms designed for the single-instruction-multiple-data (SIMD) instructions available on commodity processors. Our end goal is to process billions of records per second per core.

Bio:

Daniel Lemire is a computer science professor at the Université du Québec (TELUQ). He has also been a research officer at the National Research Council of Canada and an entrepreneur. He has written over 70 peer-reviewed publications, including more than 40 journal articles. He has held competitive research grants for the last 15 years. He serves on the program committees of leading computer science conferences (e.g., ACM CIKM, WWW, ACM WSDM, ACM SIGIR, ACM RecSys).

He programs in C, C++, Java, JavaScript, Python, Swift and Go. He works primarily in an open-source setting. You can find his software in Git, Apache Hive, Druid, Apache Kylin, Netflix Atlas, LinkedIn Pivot, Microsoft Visual Studio Team Services and so forth. Some of his compression software is used by Apache Arrow and Apache Impala. In 2012, he was rewarded by the Google Open Source Peer Bonus Program.

He is a long-time social media user: his blog has thousands of readers and was featured on Slashdot, Reddit and Hacker News. He was one of the first Twitter users: @lemire.

28 June 2018, 10:30 am, DC 1302

Title: Targeted Crowdsourcing with a Billion (Potential) Users  video icon
Speaker: Panos Ipeirotis, NYU
Abstract:

We describe Quizz, a gamified crowdsourcing system that simultaneously assesses the knowledge of users and acquires new knowledge from them. Quizz operates by asking users to complete short quizzes on specific topics; as a user answers the quiz questions, Quizz estimates the user’s competence. To acquire new knowledge, Quizz also incorporates questions for which we do not have a known answer; the answers given by competent users provide useful signals for selecting the correct answers for these questions.

Quizz actively tries to identify knowledgeable users on the Internet by running advertising campaigns, effectively leveraging “for free” the targeting capabilities of existing, publicly available, ad placement services. Quizz quantifies the contributions of the users using information theory and sends feedback to the advertising system about each user. The feedback allows the ad targeting mechanism to further optimize ad placement.

Our experiments, which involve over ten thousand users, confirm that we can crowdsource knowledge curation for niche and specialized topics, as the advertising network can automatically identify users with the desired expertise and interest in the given topic. We present controlled experiments that examine the effect of various incentive mechanisms, highlighting the need for having short-term rewards as goals, which incentivize the users to contribute.

Finally, our cost-quality analysis indicates that the cost of our approach is below that of hiring workers through paid-crowdsourcing platforms, while offering the additional advantage of giving access to billions of potential users all over the planet, and being able to reach users with specialized expertise that is not typically available through existing labor marketplaces.

Bio:

Panos Ipeirotis is a Professor and George A. Kellner Faculty Fellow at the Department of Information, Operations, and Management Sciences at Leonard N. Stern School of Business of New York University. He received his Ph.D. degree in Computer Science from Columbia University in 2004.

He has received nine “Best Paper” awards and nominations and is the recipient of the 2015 Lagrange Prize, for his contributions in the field of social media, user-generated content, and crowdsourcing.

12 July 2018, 10:30 am, DC 1302

Title: Managing Big Multidimensional Data – A Journey from Data Acquisition to Prescriptive Analytics
Speaker: Torben Bach Pedersen, Aalborg University
Abstract:

Data collected from new sources such as sensors and smart devices is large, fast, and often complex. There is a universal wish to perform multidimensional OLAP-style analytics on such data, i.e., to turn it into “Big Multidimensional Data”. Supporting this is a multi-stage journey, requiring new tools and systems, and forming a new, extended data cycle with models as a key concept.

We will look at three specifics steps in this data cycle. First, we will look at model-based data acquisition and cleansing for indoor positioning data. Then we will move on to model-based distributed storage and query processing for large and fast time series in the ModelarDB system. Finally, we will present SolveDB, a SQL-based tool supporting a new type of analytics, prescriptive analytics, that integrates descriptive and predictive analytics with optimization problem solving to prescribe optimal actions. Application domains such as Smart Logistics and Smart Energy are used for illustration.

Bio:

Torben Bach Pedersen is a Professor of Computer Science at Aalborg University, Denmark. His research interests include many aspects of Big Data analytics, with a focus on technologies for "Big Multidimensional Data" — the integration and analysis of large amounts of complex and highly dynamic multidimensional data in domains such as smart energy (energy data management), logistics and transport (moving objects and GPS data), and Linked Open Data.

He is an ACM Distinguished Scientist, and a member of the Danish Academy of Technical Sciences, the SSTD Endowment, and the SSDBM Steering Committee. He has served as Area Editor for IEEE Transactions on Big Data, Information Systems and Springer EDBS, PC Chair for DaWaK, DOLAP, SSDBM, and DASFAA, and regularly serves on the PCs of the major database conferences like SIGMOD, PVLDB, ICDE and EDBT. He received Best Paper/Demo awards from ACM e-Energy and WWW.

30 July 2018, 2:00 pm, DC 1304

Title: Data Models from Traditional Databases to NoSQL Systems
Speaker: Paolo Atzeni, Università Roma Tre
Abstract:

NoSQL systems have gained their popularity for many reasons, including the flexibility they provide with modeling, which tries to relax the rigidity provided by the relational model and by the other structured models.

The talk will discuss how traditional notions related to modeling can be useful in this context as well, both in the search for standardization and uniform access (as the variety of systems and models can create difficulties in developers and in their organizations) and in the support to generic approaches to logical and physical design (with the idea that there are some principles that apply to most systems, despite the significant heterogeneity).

Bio:

Paolo Atzeni is Database Professor and Head of the Department of Engineering at Università Roma Tre. He received his Dr. Ing. degree in Electrical Engineering from Università di Roma "La Sapienza" in 1980. Before joining Università Roma Tre, he was with IASI-CNR in Rome, then a faculty member at Università di Napoli and later a professor at Università di Roma La Sapienza. He also had visiting appointments at the University of Toronto, at Università dell'Aquila and at Microsoft Research.

He has worked on various topics in the database field, including relational database theory, conceptual models and design tools, deductive databases, databases and the Web, model management, cooperation of database systems. He is the founder of the database group at Roma Tre, which includes five faculty members and various postdocs and students. They collaborate with various groups in Italy and abroad, on topics that include data models, data warehouses, data in the Web world.

He was trustee and the vicepresident of the VLDB Endowment and he is currently a member of the Executive Board of the EDBT Association, of which he is also past President.