Data Systems Seminar Series (2023-2024)

The Data Systems Seminar Series provides a forum for presentation and discussion of interesting and current data systems issues. It complements our internal data systems meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.

The talks are usually held on a Monday at 10:30AM in room DC 1304. Exceptions are flagged. Due to Covid-19, some talks might be virtual over zoom; these will be identified

The talks are open to public. 

We will post the presentation videos whenever possible. Past DSG Seminar videos can be found on the DSG YouTube channel.


The Data Systems Seminar Series is supported by


   

Reynold Cheng
Eric Lo
Patrick Valduriez
Axel Ngonga
AnHai Doan
Ihab Ilyas
Faisal Nawab
Tilmann Rabl
Xiangyao Yu
Gustavo Alonso
Wolfgang Lehner and Alexander Krause
Philip Bernstein
Darshana Balakrishnan
Matthias Weidlich
Tianzheng Wang

22 August 2023; 10:30AM

Title: HINCare: Data-Driven Volunteering notes
Speaker: Reynold C.K. Cheng, University of Hong Kong
Abstract: In Hong Kong, the number of elderly citizens is estimated to rise to one third of the population, or 2.37 million, in year 2037. As they age and become more frail, the demand for formal support services (e.g., providing domestic or escort services) will increase significantly in the coming years. However, there is a severe lack of manpower to meet these needs. Some elderly-care homes reported a 70% shortage of employees. There is thus a strong need of voluntary or part-time helpers for taking care of elders.

In this talk, I will introduce HINCare, a software platform that encourages mutual-help and volunteering culture in the community. HINCare uses the HIN (Heterogeneous Information Network) to recommend helpers to elders or other service recipients. The algorithms that use HINs and AI technologies for matching elders and helpers are based on our recent research results. This is the first time that HIN is used to support elderly care.

HINCare is now downloadable in Apple and Google Play Store, and has been serving more than a thousand of elders and helpers in NGOs (e.g., SKH and CSFC). The app is originally designed for elderly users, but has now expanded its services to support the Community Investment and Inclusion Fund (CIIF) and 14 NGOs engaged in teenage and family services. The system won the HKICT Award 2021, two Asia Smart App Awards (2021 and 2020), and the HKU Faculty Knowledge Exchange Awards 2021 HKU.

Bio: Reynold Cheng is a Professor of the Department of Computer Science in the University of Hong Kong (HKU). His research interests are in data science, big graph analytics and uncertain data management. He was the Assistant Professor in the Department of Computing of the Hong Kong Polytechnic University (HKPU) from 2005 to 2008. He received his BEng (Computer Engineering) in 1998, and MPhil (Computer Science and Information Systems) in 2000 from HKU. He then obtained his MSc and PhD degrees from Department of Computer Science of Purdue University in 2003 and 2005.

Reynold is listed as the World's Top 2% Scientists by Stanford University in 2022, and is named the 2023 AI 2000 Most Influential Scholar Honorable Mention in Database. He received the SIGMOD Research Highlights Reward 2020, HKICT Awards 2021, and HKU Knowledge Exchange Award (Engineering) 2021. He was granted an Outstanding Young Researcher Award 2011-12 by HKU. He received the Universitas 21 Fellowship in 2011, and two Performance Awards from HKPU Computing in 2006 and 2007. He is an academic advisor to the College of Professional and Continuing Education of HKPU. He is a member of IEEE, ACM, ACM SIGMOD, and UPE. He was a PC co-chair of IEEE ICDE 2021, and has been serving on the program committees and review panels for leading database conferences and journals like SIGMOD, VLDB, ICDE, KDD, IJCAI, AAAI, and TODS. He is on the editorial board of KAIS, IS and DAPD, and was a former editorial board member of TKDE.

24 August 2023; 11:00AM (Note the later start time)

Title: When Private Blockchain Meets Deterministic Database notes
Speaker: Eric Lo, Chinese University of Hong Kong
Abstract: Private blockchain as a replicated transactional system shares many commonalities with a distributed database. However, the intimacy between a private blockchain and a deterministic database has never been studied. In essence, both private blockchains and deterministic databases ensure replica consistency by determinism. While private blockchains have started to pursue deterministic transaction executions recently, deterministic databases have already studied deterministic concurrency control protocols for almost a decade. In this talk, I will present Harmony, a novel deterministic concurrency control protocol for blockchain use. We use Harmony to build a new relational blockchain, namely HarmonyBC, which features low abort rates, hotspot resiliency, and inter-block parallelism, all of which are especially important to disk-oriented blockchains.
Bio: Eric Lo is an Associate Professor of Computer Science and Engineering at the Chinese University of Hong Kong (CUHK). He holds an M.Phil degree from the University of Hong Kong, which he earned in 2005, and a PhD degree from ETH Zurich, which he obtained in 2007. Prior to joining CUHK, he worked at both Google and Microsoft. His recent research is centered around AI systems and geo-distributed databases. He has served on the program committee of all major data engineering conferences and currently holds the position of Associate Editor at PVLDB. His research has been recognized with best paper awards and best paper honorable mentions at conferences such as VLDB'05 and ICDE'12. Recently, his work received the ACM SIGMOD Research Highlight Award 2020.

5 September 2023; 10:30AM

Title: Life Science Workflow Services (LifeSWS): Motivations and Architecture notes
Speaker: Patrick Valduriez, Inria, University of Montpellier, CNRS, LIRMM
Abstract: Data driven science requires manipulating large datasets coming from various data sources through complex workflows based on a variety of models and languages. With the increasing number of big data sources and models developed by different groups, it is hard to relate models and data and use them in unanticipated ways for specific data analysis. Current solutions are typically ad-hoc, specialized for particular data, models and workflow systems. In this talk, we focus on data driven life science and propose an open service-based architecture, Life Science Workflow Services (LifeSWS), which provides data analysis workflow services for life sciences. We illustrate our motivations and rationale for the architecture with real use cases from life science.
Bio: Patrick Valduriez is a director of research emeritus at Inria, France, the scientific director of the Inria-Brasil international lab. and the Chief Scientist Officer of the LeanXcale company (that delivers a NewSQL database).

He is currently a member of the Zenith team (between Inria and University of Montpellier at the LIRMM lab.), focusing on data science, in particular data management in large-scale distributed and parallel systems and scientific data management. He has authored and co-authored more than 400 technical papers and several textbooks, among which “Principles of Distributed Database Systems” (with Professor Tamer Özsu, University of Waterloo). He currently serves as associate editor of the Distributed and Parallel Databases journal. He has served as PC chair of major conferences such as SIGMOD and VLDB. He was the general chair of SIGMOD 2004, EDBT 2008 and VLDB 2009.

He received several best paper awards, including VLDB 2000. He was the recipient of the 1993 IBM scientific prize in Computer Science in France and the 2014 Innovation Award from Inria and the French Academy of Science. He is an ACM Fellow since 2013.

25 September 2023; 10:30AM

Title: Class Expression Learning with Multiple Representations 
Speaker: Axel-Cyrille Ngonga Ngomo, Paderborn University
Abstract: RDF knowledge bases are now first-class citizens of the Web with over 100 billion RDF assertions in the 2022 WebDataCommons crawl. Developing explainable machine learning approaches tailored towards this data is hence a task of increasing importance. In this talk, we focus on class expression learning (also called concept learning) on large RDF knowledge graphs. We begin by presenting approaches based on refinement operators, the most common family of solutions for concept learning. We then continue by presenting the most important performance bottlenecks of concept learning based on refinements. We then present recent works that address each of those bottlenecks using a dedicated representation (e.g., neural, symbolic). We conclude by presenting some open challenges in concept learning and related areas.
Bio: Axel Ngonga is a professor at Paderborn University, where he heads the Data Science Group. He studied Computer Science in Leipzig. In his PhD thesis, he developed knowledge-poor methods for the extraction of taxonomies from large text corpora. After completing his PhD, he wrote a Habilitation on link discovery with a focus on machine learning and runtime optimization. In his current research, he focuses on data-driven methods to improve the lifecycle of knowledge graphs. These include techniques for the extraction of knowledge graphs, the verification of their veracity, their integration and fusion, their use in machine learning, and their exploitation in user-facing applications such as question answering systems and chatbots. He is the grateful recipient of over 25 international research prizes, including a Next Einstein Fellowship and 6 best research paper awards.

16 October 2023; 10:30AM

Title: What is Next for Data Integration? 
Speaker: AnHai Doan, University of Wisconsin
Abstract: Data powers the modern world. Consequently, there is a critical need to integrate data, so that it can be used to fuel a wide variety of applications. Yet data integration R&D, in both academia and industry, has fallen short in addressing this need. In this talk I discuss why, drawing from my experience in the past four years in industry, when I worked as a VP of Technology at Informatica, leading a team that applied machine learning and database techniques to build data integration solutions. I propose what we can do as a field going forward, and discuss the potential impacts of the latest trend, large language models, on the field.
Bio: AnHai Doan is Vilas Distinguished Achievement Professor and Gurindar S. Sohi Professor of Computer Science at the University of Wisconsin-Madison. He has worked on data integration challenges for more than 20 years, in both academia and industry. His work has been recognized by the ACM Doctoral Dissertation Award, CAREER Award, Sloan Fellowship, ACM Research Highlight, and others, and has been commercialized via several startups, impacting millions of users and thousands of enterprises. AnHai has served on the SIGMOD Advisory Committee, SIGMOD Executive Committee, and co-chaired SIGMOD-2020.

4 December 2023; 10:30AM (DC 1302 - Note the room change)

Title: Structured Knowledge and Data Management for Effective AI Systems 
Speaker: Ihab Ilyas, University of Waterloo
Abstract: Can structured data management play an important role in accelerating AI? In this talk I focus on two main aspects of structured data management and argue that they are key in powering and accelerating AI application development: 1) Automating data quality and cleaning using generative models; and 2) constructing and serving structured knowledge graphs and their role in semantic annotation and grounding unstructured data.In the first thrust, I will summarize our findings building the HoloClean project. HoloClean builds generative probabilistic models describing how data was intended to look like, and use them for predicting errors and repairs.On the structured knowledge front, I will describe our work building Saga, an end-to-end platform for incremental and continuous construction of large scale knowledge graphs. Saga demonstrates the complexity of building such platform in industrial settings with strong consistency, latency, and coverage requirements. I will discuss challenges around building entity linking and fusion pipelines for constructing coherent knowledge graphs; updating the knowledge graphs with real-time streams; and finally, exposing the constructed knowledge via ML-based entity disambiguation and semantic annotation. I will also show how to query such knowledge via vector representation capable of handling hybrid similarity/filtering workloads.
Bio: Ihab Ilyas is a professor in the Cheriton School of Computer Science and the NSERC-Thomson Reuters Research Chair on data quality at the University of Waterloo. He is currently on leave as a Distinguished Engineer at Apple, where he lead the Knowledge Graph Platform team. His main research focuses on data science and and data management, with special interest in data cleaning and integration, knowledge construction, and machine learning for structured data management. Ihab is a co-founder of Tamr, a startup focusing on large-scale data integration, and he is also the co-founder of inductiv (acquired by Apple), a Waterloo-based startup on using AI for structured data cleaning. He is a recipient of the Ontario Early Researcher Award, a Cheriton Faculty Fellowship, an NSERC Discovery Accelerator Award, and a Google Faculty Award, and he is an ACM Fellow and an IEEE Fellow.

11 December 2023; 10:30PM

Title: Enabling Emerging Edge and IoT Applications with Edge-Cloud Data Management notes
Speaker: Faisal Nawab, University of California - Irvine
Abstract: The potential of Edge and IoT applications encompasses realms like smart cities, mobility solutions, and immersive technologies. Yet, the actualization of these promising applications stumbles upon a fundamental impediment: the prevailing cloud data management technologies are often tethered to remote data centers. This architectural choice introduces daunting challenges, including substantial wide-area latency, burdensome connectivity and communication bandwidth demands, and regulatory constraints related to personal and sensitive data.

This talk presents our research in introducing edge-cloud data management that provides a framework for managing data across edge nodes to overcome the limits of cloud-only data management. We encounter various challenges to achieving this vision such as managing the sheer amount of edge nodes, their sporadic availability, and device constraints in terms of compute, storage, and trust. To navigate these multifaceted challenges, our work redesigns distributed data management technologies to adapt to the edge environment. This includes introducing design concepts in the domains of hierarchical and asymmetric edge-cloud data management, decentralized edge coordination techniques, and edge-friendly mechanisms to maintain security and trust. The talk includes a demonstration of 'AnyLog'–an edge-cloud data management solution that integrates our research findings.
Bio Faisal Nawab is an assistant professor in the computer science department at the University of California, Irvine. He is the director of EdgeLab, which is dedicated to building edge-cloud data management solutions for emerging edge and IoT applications. Faisal's research is influenced by practical industry problems through his involvement with the startup 'AnyLog' where he acts as the lead architect of designing an edge-cloud database. Faisal has received recognition for his work, winning the "Next-Generation Data Infrastructure" award from Facebook, being named the runner-up for the IEEE TEMS Blockchain Early-Career Award, and being awarded several NSF grants, and industry funding from Meta and Roblox.

12 March 2024; 1:30PM (Over zoom)

Title: Utilizing fast interconnects on GPUs for data processingnotes video
Speaker: Tilmann Rabl, University of Potsdam & Hasso Plattner Institute
Abstract: GPUs are one of the main drivers of modern AI applications. For database processing, however, they have been mostly disregarded, because of limited memory capacity and limited interconnect bandwidth for ad hoc data transfers. Recent developments in interconnect technology, such as NVLink, enable orders of magnitude faster transfers than the long standing PCI-e 3.0 standard. 
In this talk, we will give an overview of GPU topologies, modern GPU interconnects, and basic processing on GPU. Based on this, we will present how to scale Join Processing and Sorting beyond GPU memory capacity and on multiple GPUs by utilizing modern GPU interconnects. 
Bio Tilmann Rabl received his Ph.D. from the University of Passau in 2011. After finishing his PhD thesis on the subject of scalability and data allocation in cluster databases, he continued his work as a postdoctoral researcher at the Middleware Systems Research Group at the University of Toronto. In 2015, he joined the Database Systems and Information Management group at Technische Universität Berlin as a senior researcher and visiting professor and held the position of Vice Director of the Intelligent Analytics for Massive Data group at the German Research Center for Artificial Intelligence. Since 2019, he has held the chair for Data Engineering Systems at the Digital Engineering Faculty of the University of Potsdam and the Hasso Plattner Institute. His research focuses on efficiency of database systems, real-time analytics, hardware efficient data processing, and benchmarking.

12 March 2024; 3:00PM (Over zoom)

Title: GPU Databases---The New Modality of Data Analyticsnotes video
Speaker: Xiangyao Yu, University of Wisconsin - Madison
Abstract: The performance gap between GPUs and CPUs has been widening over years as the hardware improves. Existing GPU databases demonstrate good performance, but suffer from limited GPU memory capacity and PCIe bandwidth, thereby failing to scale to large datasets. We conduct a series of projects to address these challenges, paving the way for wider GPU database adoption. In particular, I will present several projects: (1) an execution engine optimized for GPU architecture, (2) efficient data compression and decompression in GPU, (3) heterogeneous CPU-GPU query processing, (4) optimized user-defined functions, and (5) multi-GPU databases. We believe GPUs can potentially become the new modality of SQL analytics in the near future.
Bio Xiangyao Yu is an Assistant Professor at the University of Wisconsin-Madison. His research interests include (1) cloud-native databases, (2) new hardware for databases, and (3) core DB techniques in both transaction and analytical processing. Before joining UW-Madison, he finished postdoc and PhD at MIT. Xiangyao received the NSF CAREER Award and the Sloan Research Fellowship.

21 March 2024; 1:30PM (Over zoom)

Title: Disaggregation and Streaming: the new frontiers for data processingnotes
Speaker: Gustavo Alonso, ETH Zürich
Abstract: Computing platforms are evolving rapidly along many dimensions but specialization, disaggregation, and streaming are at the cornerstone of these developments. These changes are being driven mostly by LLM/AI/ML applications but also arise from the need to make cloud platforms more efficient. From a practical perspective, the result we see today is a deluge of possible configurations and deployment options, most of them too new to have a precise idea of their performance implications and lacking proper support in the form of tools and platforms that can manage the underlying diversity. The growing heterogeneity is opening up many opportunities but also raising significant challenges. In the talk I will describe the trend towards specialization and disaggregation at all layers of the architecture, provide several examples from our own research, and bring up the often forgotten issue of how to program widely heterogeneous systems.
Bio Gustavo Alonso is a professor in the Department of Computer Science of ETH Zurich where he is a member of the Systems Group. His research interests include data management, distributed systems, cloud computing architecture, and hardware acceleration through reconfigurable computing. Gustavo has received 4 Test-of-Time Awards for his research in databases, software runtimes, middleware, and mobile computing. He is an ACM Fellow, an IEEE Fellow, a Distinguished Alumnus of the Department of Computer Science of UC Santa Barbara, and has received the Lifetime Achievements Award from the European Chapter of ACM SIGOPS (EuroSys).

21 March 2024; 3:00PM (Over zoom)

Title: The Rise of ‘ScaleFlex’ – Modern Hardware Requires Modern Solutionsnotes
Speaker: Wolfgang Lehner and Alexander Krause, TU Dresden
Abstract: ScaleUp and ScaleOut architectures have guided the design and development of scalable database engines in the last decade. However, upcoming hardware developments are promoting highly disaggregated computing platforms, which allow a “best of breed” approach and require a systematic separation of compute and storage. Within this presentation, we will outline the potential and challenges of dis-aggregated memory infrastructures for database systems. We will also report on recent research contributions that strive to extend compile and runtime components of database systems, in order to cope with specialized hardware components, like leveraging FPGAs, thus pushing the envelope of highly efficient next-generation database engines.
Bio Wolfgang Lehner is Director of the Institute of Systems Architecture and head of the Database Research Group at TU Dresden, Germany. His research focuses on database system architectures specifically looking at crosscutting aspects from data engineering algorithms and efficient data structures up to hardware-related aspects. He is internationally visible within the research community with presentations, tutorials, and demos and maintains a strong network of academic and industrial partners. Wolfgang Lehner serves the international data management community in many roles, being Co-PC-chair of many conferences and workshops, member of editorial boards and reviewing committees, and acts as the Managing Editor of “Proceedings of the VLDB Endowment” (PVLDB). He also serves on the grants committee of collaborative research centers within the German Research Foundation (DFG). He is an appointed member of the German Science and Humanities Council and of the Academy of Europe.

Alexander Krause is a PostDoc at TU Dresden’s Database Research Group, chaired by Wolfgang Lehner. His dissertation on “Graph Pattern Matching on Symmetric Multiprocessor Systems” focused on energy-efficient and adaptive graph processing. Alexander is an appointed member of the SIGMOD Availability and Reproducibility Committee (ARC). Previously, he served as the Proceedings Chair for BTW 2021 and 2023 in Dresden. His research mainly focuses on the design of scalable data system in the context of disaggregated systems by leveraging RDMA and CXL.

28 March 2024; 3:00PM (Over zoom)

Title: A Remote Dynamic Memory Cache Using Spot VMsnotesvideo
Speaker: Philip Bernstein, Microsoft Research
Abstract: Data management systems are hungry for main memory, and cloud data centers are awash in it. But that memory is not always easily accessible and often too expensive. To bridge this gap, we propose a new cloud service that allows a data-intensive system to opportunistically offload its in-memory data, and computation over that data, to a remote cache. Each cache is a byte-array hosted by multiple VMs – spot VMs when possible or statically provisioned VMs when not. We built two prototypes of the service: Redy and CompuCache.
  • Redy uses RDMA for fast reads and writes on the remote cache. It automatically customizes the resource configuration for the given SLO. Its performance is significantly better than server-local SSD.
  • CompuCache uses eRPC to execute stored procedures on the cache server, distributing each stored procedure execution across the instances. It executes 126 million stored procedure invocations per second on one VM with 16 threads.
Both prototypes handle dynamic reassignment of remote memory regions and recovery from failures. This is joint work with Qizhen Zhang, published at VLDB 2022 and CIDR 2022.
Bio Philip A. Bernstein is a Distinguished Scientist at Microsoft Research and an Affiliate Professor at University of Washington. He was previously a product architect at Microsoft and Digital Equipment Corp., a professor at Harvard University and Wang Institute of Graduate Studies, and a VP Software at Sequoia Systems. He has published over 150 papers and two books on the theory and implementation of database systems, especially on transaction processing and data integration, and has contributed to a variety of database products. He is a Fellow of the ACM and AAAS, a winner of the E.F. Codd SIGMOD Innovations Award, and a member of the Washington State Academy of Sciences and the U.S. National Academy of Engineering. He received a B.S. degree from Cornell and M.Sc. and Ph.D. from University of Toronto.

29 April 2024; 10:30AM

Title: ASTral: Fast Rewriting for Query Optimization 
Speaker: Darshana Balakrishnan, University at Buffalo
Abstract: A compiler's optimizer operates over abstract syntax trees (ASTs), continuously applying rewrite rules to replace subtrees of the AST with more efficient ones. Especially on large source repositories, even simply finding opportunities for a rewrite can be expensive, as optimizer traverses the AST naively. Moreover some of the search tasks may be repeated across rewrites which making the search a redundant effort.  In this talk, we look at two orthogonal approaches and explore options for making the search faster through indexing, incremental view maintenance (IVM) and state machines.
Bio: Darshana Balakrishnan is a Software Developer at Amazon Web Services with the Redshift team in Toronto and a final semester Phd candidate at the University at Buffalo. Her prior work on incremental and declarative compilers titled Tree Toaster and Fluid Data Structures has been published at SIGMOD and DBPL.

27 May 2024; 10:30AM

Title: Efficient Distributed Complex Event Processing video
Speaker: Matthias Weidlich, Humboldt University
Abstract: Complex event processing emerged as a computational paradigm to detect patterns in event streams based on the continuous evaluation of event queries. Once such queries are evaluated in a network of event sources, efficient query evaluation may be achieved through the distributed evaluation of queries. In this talk, we present some of our recent results on achieving such distribution with graph-based evaluation plans as well as optimizations that rely on push-pull-communication.
Bio: Matthias Weidlich is a full professor at the Department of Computer Science at Humboldt-Universität zu Berlin (HU Berlin), Germany, where he holds the Chair on Databases and Information Systems. Before joining HU Berlin, he held positions at Imperial College London and at the Technion - Israel Institute of Technology. He has a PhD in Computer Science from the Hasso-Plattner-Institute, University of Potsdam. His research focuses on data-driven process analysis, event stream processing, and exploratory data analysis. He serves as Co-Editor in Chief for the Information Systems journal and is a member of the steering committees of the ACM DEBS and BPM conference series.

5 June 2024; 10:30AM (Note the unusual day)

Title: Opportunities for Latency Hiding in Modern OLTP Engines video
Speaker: Tiangzheng Wang, Simon Fraser University
Abstract: Traditional OLTP engines are limited by various latencies, such as I/O, memory stalls, synchronization and scheduling. Hiding such latency has been a major goal to achieve high transaction processing performance, but prior efforts have seen limited adoption by missing joint optimizations that mitigate the impact of multiple latency sources. A prime example is software prefetching which interleaves memory access and compute is often at odds with asynchronous I/O. In this talk, we will revisit various sources of latency in memory-optimized, larger-than-memory OLTP engines and propose several solutions that allow joint optimizations. We show how I/O and other sources of latency can be hidden to achieve high throughput, without cancelling out the benefits of software prefetching. The gist is a simple but effective scheduling scheme based on lightweight asynchronous I/O and stackless coroutines (e.g., those in C++20). We also emphasize the effort to make these work in an end-to-end database engine and considerations beyond performance, such as programmability and backward compatibility.
Bio: Tianzheng Wang is an assistant professor in the School of Computing Science at Simon Fraser University (SFU) in Metro Vancouver, Canada. His research centres around the making of database systems in the context of modern hardware, new programming language features and primitives, and new applications. His work also often extends to related areas such as operating systems, parallel programming and distributed systems. Tianzheng Wang received his Ph.D. and M.Sc. degrees in Computer Science from the University of Toronto in 2017 and 2014, respectively. He received his B.Sc. in Computing degree (First Class Honours) from Hong Kong Polytechnic University in 2012. Prior to joining SFU, he spent one year (2017-2018) at Huawei Canada Research Centre (Toronto) as a research engineer. In addition to adoptions by major cloud vendors and startups, his work has been recognized by two ACM SIGMOD Research Highlight Awards (2020 and 2022), a 2019 IEEE TCSC Award for Excellence in Scalable Computing (Early Career Researchers) and nominations for best/memorable paper awards.