The Data Systems Seminar Series provides a forum for presentation and discussion of interesting and current data systems issues. It complements our internal data systems meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.
The talks are usually held on a Monday at 10:30 am in room DC 1302. Exceptions are flagged. Due to Covid-19, until further notice all talks will be virtual over zoom.
We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.
The Database Seminar Series is supported by
Danqi Chen |
Jennie Rogers |
Peter Boncz |
Aaron Elmore |
Scott Meyer |
Tianzheng Wang |
Pınar Tözün |
Xiaokui Xiao |
14 September 2020; 10:30AM
Title: | Recent Advances in Open-domain Question Answering |
Speaker: | Danqi Chen, Princeton University |
Abstract: | Open-domain question answering, the task of automatically answering questions posed by humans in a natural language, usually based on a large collection of unstructured documents, has (re-)gained a lot of popularity in the last couple of years. In this talk, I will discuss many recent exciting developments which have greatly advanced the field, including several works of ours. In particular, I would like to discuss the importance of pre-training for question answering, learning dense representations for retrieval in place of sparse models, the role of structured knowledge, as well as the trade-off between open-book and closed-book models. I will conclude with current limitations and future directions. |
Bio: | Danqi Chen is an Assistant Professor of Computer Science at Princeton University and co-leads the Princeton NLP Group. Her research focuses on deep learning for natural language processing, especially in the intersection of text understanding and knowledge representation & reasoning and applications in question answering, information extraction, and conversational systems. Before joining Princeton, Danqi worked as a visiting scientist at Facebook AI Research in Seattle. She received her Ph.D. from Stanford University (2018) and B.E. from Tsinghua University (2012), both in Computer Science. In the past, she was a recipient of the 2019 Arthur Samuel Best Doctoral Thesis Award at Stanford University, a Facebook Fellowship, a Microsoft Research Women’s Fellowship, and paper awards at ACL’16 and EMNLP’17. |
19 October 2020; 10:30AM
Title: | Privacy-preserving querying for data federations |
Speaker: | Jennie Rogers, Northwestern University |
Abstract: | We live in a golden age of data abundance. In numerous domains - including healthcare, education research, sociology, and finance - it is standard practice for data owners to keep their records in private silos to which only a few trusted users have access. As a result, data about individuals or entities are routinely fractured over two or more silos. Researchers and analysts in these domains wish to learn aggregates over fractured data, but cannot do so owing to privacy concerns or regulatory requirements. A private data federation is a data sharing platform with which an analyst queries the union of the records of multiple silos using cryptographic protocols such that no information is revealed except that which can be deduced from its query answers. These answers are optionally noised with differential privacy to withhold information about individuals in the dataset. The data owners evaluate a private data federation query amongst themselves using secure multi-party computation. This security comes at a high performance cost, and evaluating queries naïvely with this approach is orders of magnitude slower than running the same workload insecurely. To offer efficient query evaluation and provable privacy guarantees, my team and I generalized principles of query optimization to this setting. We also created novel techniques to bring approximate query processing to this platform to both speed up querying and to contribute noise to its privacy-preserving query answers. I will close with a discussion of our pilot study deploying this technology in Chicago-area hospitals for clinical research. |
Bio: | Jennie Rogers is an assistant professor of Computer Science at Northwestern University. Her research is motivated by empowering people with data. More specifically, she investigates pragmatic privacy-preserving data analytics, federating databases over multiple data models, and new approaches with which individuals can explore and understand their data. She received the NSF CAREER Award in 2019 and the Northwestern Computer Science Faculty Service Award in 2020. |
16 November 2020; 10:30AM
Title: | FSST: fast random-access string compression |
Speaker: | Peter Boncz, CWI |
Abstract: | Strings are prevalent in real-world data sets. They often occupy a large fraction of the data and are slow to process. In this work, we present Fast Static Symbol Table (FSST), a lightweight compression scheme for strings. On text data, FSST offers decompression and compression speed similar to or better than the best speed-optimized compression methods, such as LZ4, yet offers significantly better compression factors. Moreover, its use of a static symbol table allows random access to individual, compressed strings, enabling lazy decompression and query processing on compressed data. We believe these features will make FSST a valuable piece in the standard compression toolbox. |
Bio: | Peter Boncz holds appointments as tenured researcher at CWI and professor at VU University Amsterdam. His academic background is in core database architecture, with the MonetDB the systems outcome of his PhD -- MonetDB much later won the 2016 ACM SIGMOD systems award. He has a track record in bridging the gap between academia and commercial application, receiving the Dutch ICT Regie Award 2006 for his role in the CWI spin-off company Data Distilleries. In 2008 he co-founded Vectorwise around the analytical database system by the same name which pioneered vectorized query execution -- later acquired by Actian. He is co-recipient of the 2009 VLDB 10 Years Best Paper Award, and in 2013 received the Humboldt Research Award for his research on database architecture. He also works on graph data management, founding in 2013 the Linked Database Benchmark Council (LDBC), a benchmarking organization for graph database systems. |
14 December 2020; 10:30AM
Title: | |
Speaker: | Aaron Elmore, University of Chicago |
Abstract: | The coming end of Moore’s law requires that data systems be more judicious with computation and resources as the growth in data outpaces the availability of computational resources. Current database systems are eager and aggressively consume resources to immediately and quickly complete the task at hand. Intelligently deferring a task to a later point in time can increase result reuse, reduce work that might later be invalidated, or avoid unnecessary work altogether. In this talk I will introduce CrocodileDB, a resource-efficient database system that automatically optimizes deferment based on user-specification and workload prediction. CrocodileDB integrates new ways of specifying timing information, new query execution policies, new task schedulers, and new data loading schemes. In particular, this talk will highlight two new query execution paradigms, Intermittent Query Processing and Incremental-Aware Query Execution. |
Bio: | Aaron J. Elmore is an Assistant Professor in the Department of Computer Science, and the College of the University of Chicago. Aaron was previously a Postdoctoral Associate at MIT. Aaron's thesis on Elasticity Primitives for Database-as-a-Service was completed at the University of California, Santa Barbara. His recent research interests focus on building data systems that address the growing data deluge. He is currently an associate editor for SIGMOD record, and has served as co-chair for SIGMOD demonstration track, the inaugural SIGMOD student research competition, and VLDB proceeding editor. |
11 January 2021; 3:00PM (Please note the unusual time)
Title: |
LIquid: Soul of a New Graph Database |
Speaker: | Scott Meyer, LinkedIn |
Abstract: | The talk will present LIquid, the graph database which serves what we call the “on-line economic graph.” LIquid is in full production, serving 1M QPS against a ~10Tb corpus. The first half of the talk defines what a graph database is, ultimately a relational system which satisfies certain constraints. The second half explains how one might go about building such a graph database. |
Bio: | Scott has had a long and varied career in software development, including graphics, networking, GUI applications, OODBs, a JVM implementation, 3 years spent sailing around the Pacific, and most recently, 15 years working on graph databases at Metaweb, Google, and LinkedIn. |
19 April 2021; 1:00PM (Note the different time)
Title: | Hiding data stalls in main-memory database engines using coroutines |
Speaker: | Tianzheng Wang, Simon Fraser University |
Abstract: | As the speed gap between memory and CPU continues to widen, memory accesses are becoming a major overhead in pointer-rich data structures, such as B-trees, hash tables and linked lists, which are widely used to in modern database systems. Software prefetching can be effective in hiding such stalls, by careful scheduling and batching that load the needed memory blocks in advance, but requires various changes in the code base with a vastly different multi-key interface and was mostly piecewise solutions. In this talk we will highlight our experience of tackling these challenges using recently standardized coroutines in C++20 in a full databae engine. The crux is a new "coroutine-to-transaction" paradigm that simplifies application development with backward compatibility, while retaining the performance benefits of software prefetching. |
Bio: | Tianzheng Wang is an assistant professor in the School of Computing Science at Simon Fraser University (SFU) in Vancouver, Canada. He works on the boundary between software and modern hardware (in particular persistent memory, manycore processors and next-generation networks). His current research focuses on database systems and related areas that impact the design of data-intensive systems, such as operating systems, distributed systems and synchronization. Tianzheng Wang received his Ph.D. and M.Sc. degrees in Computer Science degrees from the University of Toronto in 2017 and 2014, respectively (advised by Ryan Johnson and Angela Demke Brown). He received his B.Sc. in Computing (First Class Honours) degree from Hong Kong Polytechnic University in 2012. Prior to joining SFU, he spent one year (2017-2018) at Huawei Canada Research Centre (Toronto) as a research engineer. His work has been recognized by a 2021 ACM SIGMOD Research Highlight Award, a 2019 IEEE TCSC Award for Excellence in Scalable Computing (Early Career Researchers) and nominations for best/memorable paper awards. |
17 May 2021; 10:30AM
Title: | Data-Intensive Systems in the Microsecond Era |
Speaker: | Pınar Tözün, IT University of Copenhagen |
Abstract: |
Late 2000s and early 2010s have seen the rise of data-intensive systems optimized for in-memory execution. Today, it has been increasingly clear that just optimizing for main memory is neither economically viable nor strictly necessary for high performance. Modern SSDs, such as Z-NAND and Optane, can access data at a latency of around 10 microseconds. As a result, there has been a wide range of work that design data structures that can take advantage of both main memory and modern SSDs. In parallel, there are efforts to transform the software stack for IOs for better efficiency. In this talk, I will be going over the landscape of modern SSDs and storage stack focusing on the challenges and opportunities that await data-intensive systems when it comes to exploiting them. |
Bio: | Pınar Tözün is an Associate Professor at IT University of Copenhagen. Before ITU, she was a research staff member at IBM Almaden Research Center. Prior to joining IBM, she received her PhD from EPFL. Her thesis received ACM SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention in 2016. Her research focuses on performance characterization of data-intensive workloads, scalability and efficiency of data-intensive systems on modern processors and storage, and resource-aware machine learning. |
14 June 2021; 9:30AM (Note the different time)
Title: | Efficient Network Embeddings for Large Graphs |
Speaker: | Xiaokui Xiao, National University of Singapore |
Abstract: | Given a graph G, network embedding maps each node in G into a compact, fixed-dimensional feature vector, which can be used in downstream machine learning tasks. Most of the existing methods for network embedding fail to scale to large graphs with millions of nodes, as they either incur significant computation cost or generate low-quality embeddings on such graphs. In this talk, we will present two efficient network embedding algorithms for large graphs with and without node attributes, respectively. The basic idea is to first model the affinity between nodes (or between nodes and attributes) based on random walks, and then factorize the affinity matrix to derive the embeddings. The main challenges that we address include (i) the choice of the affinity measure and (ii) the reduction of space and time overheads entailed by the construction and factorization of the affinity matrix. Extensive experiments on large graphs demonstrate that our algorithms outperform the existing methods in terms of both embedding quality and efficiency. |
Bio: | Xiaokui Xiao is a Dean's Chair Associate Professor at the School of Computing, National University of Singapore (NUS). His research focuses on data management, with special interests in data privacy and algorithms for large data. He received a Ph.D. in Computer Science from the Chinese University of Hong Kong in 2008. Before joining NUS in 2018, he was an associate professor at the Nanyang Technological University, Singapore. |