The Data Systems Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.
The talks are usually held on a Monday at 10:30 am in room DC 1302. Exceptions are flagged.
We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.
The Database Seminar Series is supported by
|Panos K. Chrysanthis|
|Title:||What we could reason about the design space of data structures?|
|Speaker:||Stratos Idreos, Harvard University|
Data structures are critical in any data-driven scenario, and they define the behavior of modern data systems. However, they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously. In this talk, we ask two questions: What if we knew how many and which data structures are possible to design? What if we could compute the expected performance of a data structure design on a given workload and hardware without having to implement it and without even having access to the target machine? We will discuss our quest for 1) the first principles of data structures, 2) design continuums that make it possible to automate design, and 3) self-designing systems that can morph between what we now consider fundamentally different structures. We will draw examples from the NoSQL key-value store design space and discuss how to accelerate them and balance space-time tradeoffs.
|Bio:||Stratos Idreos is an assistant professor of Computer Science at Harvard University where he leads DASlab, the Data Systems Laboratory@Harvard SEAS. Stratos works on data system architectures with emphasis on how we can make it easy to design efficient data systems as applications and hardware keep evolving and on how we can make it easy to use these systems even for non-experts. For his doctoral work on Database Cracking, Stratos won the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award. He is also a recipient of an IBM zEnterpise System Recognition Award, a VLDB Challenges and Visions best paper award and an NSF Career award. In 2015 he was awarded the IEEE TCDE Rising Star Award from the IEEE Technical Committee on Data Engineering for his work on
adaptive data systems.
|Title:||ExpoDB: Towards a Unified OLTP and OLAP Over a Secure Platform|
|Speaker:||Mohammad Sadoghi, UC Davis|
|Abstract:||Arguably data is a new natural resource in the enterprise world with an unprecedented degree of proliferation and heterogeneity. However, to derive real-time actionable insights from the data, it is important to bridge the gap between analyzing a large volume of data (i.e., OLAP) and managing the
data that is being updated at a high velocity (i.e., OLTP). Historically, there has been a divide where specialized engines were developed to support either OLAP or OLTP workloads but not both; thus, limiting the analysis to stale and possibly irrelevant data.
In this talk, we present our proposed architecture to combine the real-time processing of analytical and transactional workloads within a single unified engine. To support querying and retaining the current and historic data, we design a novel efficient index maintenance techniques paving the way to a novel optimistic concurrency control. From the concurrency perspective, we further pose a question: is it possible to have concurrent execution over shared data without having any concurrency control? To answer this question, we investigate a deterministic approach to transaction processing geared towards many-core hardware by proposing a novel queue-oriented, control-free concurrency architecture (QueCC) that exhibits minimal coordination during execution while offering serializable guarantees. From the storage perspective, we develop an update-friendly lineage-based storage architecture (LSA) that offers a contention-free and lazy staging of columnar data from a write-optimized form (OLTP) into a read-optimized form (OLAP) in a transactionally consistent approach. Finally, we share our vision to move from a centralized platform onto a secure democratic and decentralized computational model.
Mohammad Sadoghi is an Assistant Professor of Computer Science at the University of California, Davis. Formerly, he was an Assistant Professor at Purdue University and Research Staff Member at IBM T.J. Watson Research Center. He received his Ph.D. from the Computer Science Department at the University of Toronto in 2013. His research spans all facets of secure and massive-scale data management. At UC Davis, he leads the ExpoLab research group with the aim to pioneer a new exploratory data platform—referred to as ExpoDB—a distributed ledger that unifies secure transactional and real-time analytical processing, all centered around a democratic and decentralized computational model. Prof. Sadoghi has over 60 publications and has filed 34 U.S. patents. His SIGMOD'11 paper was awarded the EPTS Innovative Principles Award, his EDBT'11 paper was selected as one of the best EDBT papers in 2011, and his ESWC'16 paper won the Best In-Use Paper Award. He is serving as Workshop/Tutorial Co-Chair at Middleware'18, has served as the PC Chair (Industry Track) at ACM DEBS'17, co-chaired a new workshop series, entitled Active, at both ICDE and Middleware, and co-chaired the Doctoral Symposium at Middleware'17. He served as the Area Editor for Transaction Processing in the Encyclopedia of Big Data Technologies by Springer. He is co-authoring a book on "Transaction Processing on Modern Hardware" as part of Morgan & Claypool Synthesis Lectures on Data Management. He regularly serves on the program committee of SIGMOD, VLDB, ICDE, EDBT, Middleware, ICDCS, DEBS, and ICSOC.
|Speaker:||A. Erdem Sarıyüce, University at Buffalo|
|Abstract:||Finding dense substructures in a network is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasi-clique, densest at-least-k subgraph) are NP-hard. Furthermore, the goal is rarely to find the “true optimum” but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. In this talk, I will talk about a framework that we designed to find dense regions of the graph with hierarchical relations. Our model can summarize the graph as a tree of subgraphs. With the right parameters, our framework generalizes two widely accepted dense subgraph models; k-core and k-truss decompositions. We present practical sequential and parallel local algorithms for our framework and empirically evaluate their behavior in a variety of real graphs. Furthermore, we adapt our framework for bipartite graphs which are used to model group relationships such as author-paper, word-document, and user-product data. We demonstrate how proposed algorithms can be utilized for the analysis of a citation network among physics papers and user-product network of the Amazon Kindle books.|
|Bio:||A. Erdem Sariyuce is an Assistant Professor in Computer Science and Engineering at the University at Buffalo. Prior to that, he was the John von Neumann postdoctoral fellow at Sandia National Laboratories. Erdem received his Ph.D. in Computer Science from the Ohio State University. He conducts research on large-scale graph mining. In particular, he develops practical algorithms to explore and process the real-world networks. He received Best Paper Runner-up Award at the International World Wide Web Conference (WWW) in 2015. More details can be found at http://sariyuce.com.|
|Title:||Algorithms and Optimizations for Incremental Window-Based Aggregations|
|Speaker:||Panos K. Chrysanthis, University of Pittsburgh|
Online analytics, in most advanced scientific, business, and defense applications, rely heavily on the efficient execution of large numbers of Aggregate Continuous Queries (ACQs). ACQs continuously aggregate streaming data and periodically produce results such as max or average over a given window of the latest data. It was shown that in processing ACQs it is beneficial to use incremental evaluation, which involves storing and reusing calculations performed over the unchanged parts of the window, rather than performing the re-evaluation of the entire window after each update. In this talk, we examine how the principle of sharing is applied in the partial and final aggregation techniques and present our SlickDeque and WeaveShare techniques that optimize the execution of multi-ACQs in single and multiple computing nodes.
|Bio:||Panos K. Chrysanthis is a Professor of Computer Science and the founding director of the Advanced Data Management Technologies Laboratory in the School of Computing and Information at the University of Pittsburgh. He is also an Adjunct Professor at the Carnegie Mellon University and University of Cyprus. His research interests lie at the intersection of data management, distributed systems and collaborative applications. He is a recipient of the NSF CAREER Award and he is an ACM Distinguished Scientist and a Senior Member of IEEE. He is also a recipient of the University of Pittsburgh Provost Award for Excellence in Mentoring (doctoral students). He is currently the Special Issues Coordinator for the Distributed and Parallel Databases Journal and a Program Committee Co-chair of IEEE ICDE 2018. He earned his BS degree from the University of Athens, Greece and his MS and PhD degrees from the University of Massachusetts at Amherst.|
|Speaker:||Verena Kantere, University of Ottawa|
Big Data analytics in science and industry are performed on a range of heterogeneous data stores, both traditional and modern, and on a diversity of query engines. Workflows are difficult to design and implement since they span a variety of systems. To reduce development time and processing costs, some automation is needed. In this talk we will present a new platform to manage analytics workflows. The platform enables workflow design, execution, analysis and optimization with respect to time efficiency, over multiple execution engines. Such configurations are emerging as a common paradigm used to combine analysis of unstructured data with analysis of structured data (e.g., NoSQL plus SQL). We focus on the usability of the platform by users with various expertise, the automation of the analysis and optimization of execution, as well as the effect of optimization on workflow execution. The platform performs also multi-workflow optimisation and workflow recalibration. The talk will finish with some plans for future research on data management optimization on hybrid infrastructures, i.e. infrastructures that comprise multiple sites heterogeneous parts and combine private clusters and public resources.
|Bio:||Verena Kantere is an Associate Professor at the School of Electrical Engineering and Computer Science (EECS) in the University of Ottawa (UOttawa). Before, she was an Assistant Professors at the School of Electrical and Computer Engineering (ECE) of the National Technical University of Athens (NTUA) and a Maître d’Enseignement et de Recherche at the Centre Universitaire d’ Informatique (CUI) of the University of Geneva (UniGe). She has been working towards the provision of data services in large-scale systems, like cloud systems, focusing on the management of Big Data and the performance of Big Data analytics, by developing methods, algorithms and fully fledged systems. Before coming to the UniGe she was a tenure-track junior assistant professor at the Department of Electrical Engineering and Information Technology at the Cyprus University of Technology (CUT). She has received a Diploma and a Ph.D. from the National Technical University of Athens, (NTUA) and a M.Sc. from the Department of Computer Science at the University of Toronto (UofT), where she also started her PhD studies. After the completion of her PhD studies she worked as a postdoctoral researcher at the École Polytechnique Fédérale de Lausanne (EPFL). During her graduate studies she developed methods, algorithms and fully fledged systems for data exchange and coordination in Peer-to-Peer (P2P) overlays with structured and unstructured data, focusing on the solution of problems of data heterogeneity, query processing and rewriting, multi-dimensionality and management of continuous queries. Furthermore, she has shown interest and work in the field of the Semantic Web, concerning the problem of semantic similarity, annotation, clustering and integration.|
Meta-Analysis for Retrieval Experiments Involving Multiple Test Collections
|Speaker:||Ian Soboroff, National Institute of Standards and Technology|
|Abstract:||WTraditional practice recommends that information retrieval experiments be run over multiple test collections, to support, if not prove, that gains in performance are likely to generalize to other collections or tasks. However, because of the pooling assumptions, evaluation scores are not directly comparable across different test collections. We present a widely-used statistical tool, \em meta-analysis, as a framework for reporting results from IR experiments using multiple test collections. We demonstrate the meta-analytical approach through two standard experiments on stemming and pseudo-relevance feedback, and compare the results to those obtained from score standardization. Meta-analysis incorporates several recent recommendations in the literature, including score standardization, reporting effect sizes rather than score differences, and avoiding a reliance on null-hypothesis statistical testing, in a unified approach. It therefore represents an important methodological improvement over using these techniques in isolation. Background paper|
Dr. Ian Soboroff is a computer scientist and leader of the Retrieval Group at the National Institute of Standards and Technology (NIST). The Retrieval Group organizes the Text REtrieval Conference (TREC), the Text Analysis Conference (TAC), and the TREC Video Retrieval Evaluation (TRECVID). These are all large, community-based research workshops that drive the state-of-the-art in information retrieval, video search, web search, information extraction, text summarization and other areas of information access. He has co-authored many publications in information retrieval evaluation, test collection building, text filtering, collaborative filtering, and intelligent software agents. His current research interests include building test collections for social media environments and nontraditional retrieval tasks.
|Speaker:||Dan Suciu, University of Washington|
|Abstract:||We discuss two novel connections between information theory and data management. The first is a new paradigm for query processing, which we call "from proofs to algorithms". In order to evaluate a query, one first proves an upper bound on its output, by proving an information theoretic inequality. Then, each step of this proof becomes a relational operator. The resulting algorithm is "worst case optimal", meaning that it runs in time bounded by the maximum output to the query. Second, we consider the "implication problem" for Functional Dependencies, Multivalued Dependencies, and Conditional Independence constraints in graphical models, and ask whether it relaxes to an inequality between the corresponding information theoretic measures. When it does, then we can convert an implication between exact constraints into an implication between approximate constraints. We show that FDs, MVDs, and the special cases of graphoid axioms studied by Geiger and Pearl do relax, but, rather surprisingly, there exists exact implications that cannot be relaxed.
Joint work with Mahmoud Abo Khamis, Batya Kenig, and Hung Q. Ngo,
Dan Suciu is a Professor in Computer Science at the University of Washington. He received his Ph.D. from the University of Pennsylvania in 1995, was a principal member of the technical staff at AT&T Labs and joined the University of Washington in 2000. Suciu is conducting research in data management, with an emphasis on topics related to Big Data and data sharing, such as probabilistic data, data pricing, parallel data processing, data security. He is a co-author of two books Data on the Web: from Relations to Semistructured Data and XML, 1999, and Probabilistic Databases, 2011. He is a Fellow of the ACM, holds twelve US patents, received the best paper award in SIGMOD 2000 and ICDT 2013, the ACM PODS Alberto Mendelzon Test of Time Award in 2010 and in 2012, the 10 Year Most Influential Paper Award in ICDE 2013, the VLDB Ten Year Best Paper Award in 2014, and is a recipient of the NSF Career Award and of an Alfred P. Sloan Fellowship. Suciu serves on the VLDB Board of Trustees, and is an associate editor for the Journal of the ACM, VLDB Journal, ACM TWEB, and Information Systems and is a past associate editor for ACM TODS and ACM TOIS. Suciu's PhD students Gerome Miklau, Christopher Re and Paris Koutris received the ACM SIGMOD Best Dissertation Award in 2006, 2010, and 2016 respectively, and Nilesh Dalvi was a runner up in 2008.
|Title:||Just-in-Time Indexing: Rethinking Data Layout as a Compiler Problem|
|Speaker:||Oliver Kennedy, University at Buffalo|
Indexing is a game of tradeoffs: Organize your data now and be rewarded with lower read latencies later. The question of whether, how, or when to organize has led to a proliferation of many different, often highly-specialized index structures. In this talk, I present an alternative that replaces the the holistic tradeoff decisions made by classical index structures with smaller, localized organizational transformations. This approach, which we call Just-in-Time Data Structures (JITDs), is based on a grammar for data structure instances. A sentence in this grammar corresponds to a specific physical layout, and many classical data structures can be described as syntactic restrictions on this grammar. Mirroring a just-in-time compiler, a JITD incrementally replaces phrases in a sentence (i.e., physical sub-structures) with different, ideally more efficient ones. This replacement happens in the background, without blocking reader threads for longer than an atomic pointer swap. JITDs are competitive with existing, commonly-used data structures in common-case scenarios, while exhibiting better performance in dynamic settings. I will present our foundational work on JITDs, and outline some of our more recent work on automatic rewrite policy optimization.
(This work is supported by NSF Awards IIS-1617586 and CNS-1629791)
Oliver Kennedy is an assistant professor at the University at Buffalo. He earned his PhD from Cornell University in 2011 and now leads the Online Data Interactions (ODIn) lab, which operates at the intersection of databases and programming languages. Oliver is the recipient of an NSF CAREER award, UB's Exceptional Scholar Award, and the UB SEAS Early Career Teacher of the Year Award. Several of his papers have been invited to "Best of" compilations from SIGMOD and VLDB. The ODIn lab is currently exploring uncertain data management, just-in-time data structure design, and "small data" management.
|Title:||Ultra-scalable transactional management|
|Speaker:||Ricardo Jimenez-Peris, LeanXcale|
|Abstract:||The talk will present the ultra-scalable distributed algorithm to process transactional management and how it has been implemented as part of the LeanXcale database. The talk will go into the details on how ACID properties have been scaled out independently in a composable manner.
The talk will also cover the architectural aspects of the systems and how it has been integrated with the rest of the components of the LeanXcale database, the distributed storage engine and the distributed SQL query engine. It will also be presented the underpinnings on how to blend operational and analytical workloads through innovations on the storage engine combined with a distributed OLAP query engine.
|Bio:||Dr. Ricardo Jimenez-Peris is a former professor and researcher on scalable databases and distributed systems, and currently CEO and founder of LeanXcale, a startup commercializing LeanXcale database. LeanXcale was awarded with the "Best SME" prize in 2017 by the European Commission recognizing the most innovative European startup of the year. He is co-inventor of two patents, co-author of the book "database replication" and 100+ research papers and articles. He has been invited speaker at the headquarters of top tech companies to present LeanXcale technology such as Facebook, Twitter, Salesforce, Heroku, Greenplum, Microsoft, IBM, HP, etc.|
|Title:||Scaling Database Systems to High-Performance Computers|
|Speaker:||Spyros Blanas, The Ohio State University|
We are witnessing the increasing use of warehouse-scale computers to analyze massive datasets quickly. This poses two challenges for database systems. The first challenge is interoperability with established analytics libraries and tools. Massive datasets often consist of small images (arrays) in file formats like HDF5 and FITS. We will first present ArrayBridge, an open-source I/O library that allows processing with SQL, SciDB or TensorFlow without converting between file formats. ArrayBridge can transparently optimize data placement to make I/O more than 300X faster than directly reading small files. The second challenge is scalability, as warehouse-scale computers expose communication bottlenecks in foundational data processing operations. We will present a data shuffling algorithm that carefully uses RDMA to transmit data up to 4X faster than MPI. We will then present GRASP, an aggregation algorithm for high-cardinality parallel aggregation. By carefully scheduling data transfers to leverage similarity, GRASP avoids unscaleable all-to-all communication and completes the aggregation more than 3X faster than repartitioning. We conclude by highlighting additional challenges that need to be overcome to scale database systems to massive computers.
Spyros Blanas is an assistant professor in the Department of Computer Science and Engineering at The Ohio State University. His research interest is high performance database systems, and his current goal is to build a database system for high-end computing facilities. He has received the IEEE TCDE Rising Star award and a Google Research Faculty award. He completed his Ph.D. at the University of Wisconsin–Madison where part of his Ph.D. dissertation was commercialized in Microsoft SQL Server as the Hekaton in-memory transaction processing engine.