The Data Systems Seminar Series provides a forum for exploring and discussing important topics in data systems, from current challenges to emerging trends. It complements our internal meetings by bringing fresh perspectives from invited external speakers.
The schedule for the 2025–26 academic year is outlined below and will be updated as speakers are confirmed.
Seminars are typically held monthly on Mondays at 10:30 a.m. in DC 1302, unless otherwise noted. Some sessions may be held virtually on Zoom; these will be clearly marked.
All talks are open to the public.
We will record and upload videos of presentations whenever possible. Past DSG Seminar Series videos are on the DSG YouTube channel.
The Data Systems Seminar Series is supported by

| Hazar Hamouch |
| Zhuoyue Zhao |
| Tianzheng Wang |
| Mostafa Milani |
| Boris Glavic |
| Stratos Idreos |
Monday, September 22, 2025 at 10:30 a.m.
| Title | Beyond Accuracy: Data Quality as the Backbone of Trustworthy AI (PowerPoint presentation, 11.8 MB PDF) |
| Speaker | Hazar Harmouch, Assistant Professor, Intelligent Data Engineering Lab, University of Amsterdam |
| Abstract | In the era of artificial intelligence, the quality of data has become a central determinant of system reliability, fairness, and trust. While advances in AI promise transformative applications across domains, the benefits are critically dependent on the quality of the underlying data. This talk explores how data quality shapes AI systems from multiple perspectives: performance, fairness, and compliance. I will discuss our work on assessing and improving data quality in machine learning pipelines, including step-by-step cleaning recommendations, quantifying diversity in datasets, and benchmarks for AI robustness under labeling noise and other data quality issues. I will also highlight our contributions to bridging the gap between technical quality assessment and human perspectives, including the development of a cross-disciplinary data quality glossary and surveying practitioners in light of the AI Act. Together, these insights point toward a more holistic view of data quality, one that incorporates not only statistical measures but also ethical, legal, and user-centered dimensions to build AI systems that are both effective and trustworthy. |
| Bio | Professor Harmouch is a member of the Intelligent Data Engineering Lab at the Informatics Institute of the University of Amsterdam, Netherlands. Her research focus is on the field of data quality for and with machine learning. Before that, she was a Postdoc at the Information Systems Group at the Hasso Plattner Institute, University of Potsdam. During her doctoral research at the same group, she worked in the field of data profiling with the aim of developing algorithms for efficiently processing and analyzing large volumes of data. Beyond research, Professor Harmouch is also interested in keeping up with the latest developments in machine learning, data management, and integration. She is also constantly looking for new collaboration opportunities! Outside of work, she enjoys travelling, reading, cooking and training. |
Wednesday, December 10, 2025 at 10:30 a.m. (Note the atypical day)
| Title | Enabling Fast and Correct Approximate Query Processing in HTAP Systems |
| Speaker | Zhuoyue Zhao, Department of Computer Science and Engineering, University at Buffalo |
| Abstract | Approximate Query Processing enables users to trade slight loss of accuracy for very low query latencies. For today’s Hybrid Transactional/Analytical Processing workloads, this could be very useful to replace some of the expensive analytical queries if approximation is acceptable. However, traditional AQP systems rely on scan-based random sampling and thus still incur high latencies. Meanwhile, many AQP algorithms rely on specialized sampling indexes to perform random sampling without excessively scanning, but they are often not concurrency safe or updatable. In this talk, I will present our recent work on a fast and concurrency-safe updatable sampling indexes for independent range sampling. It can sustain high rate of ingestion and sampling under snapshot isolation. It is fully integrated in PostgreSQL and we also built a new AQP extension around it. I will also discuss several challenges and promising directions for AQP in modern in-memory HTAP systems. |
| Bio | Zhuoyue Zhao is currently an assistant professor at University at Buffalo. He holds a PhD degree from University of Utah, where he was advised by Prof. Feifei Li and Prof. Jeff Phillips. His research interest is in database systems, specifically query processing and optimization, transaction processing, and storage and indexing. He received an NSF CAREER award in 2024, and two SIGMOD best paper awards in 2016 and 2025. |
Wednesday, January 28, 2026 at 10:30 a.m. in DC 1304 (Note the atypical day and location)
| Title | Taming Latency in Modern Database Engines |
| Speaker | Tianzheng Wang, Associate Professor, School of Computing Science, Simon Fraser University |
| Abstract | Everything takes time in a database engine: I/O, memory stall, synchronization and admission control all add additional delays in addition to running transaction logic. Reducing and hiding such latency has been a major goal to achieve high transaction and query performance, but prior efforts have seen limited adoption by missing joint optimizations that mitigate the impact of multiple latency sources. A prime example is software prefetching which interleaves memory access and compute is often at odds with asynchronous I/O. In this talk, we describe our recent efforts on reducing and hiding latencies by judiciously leveraging both hardware and software primitives such as prefetching instructions, asynchronous I/O and recent userspace interrupts. We also emphasize the effort to make these work in an end-to-end database engine and considerations beyond performance, such as programmability and backward compatibility. |
| Bio | Tianzheng Wang is an associate professor in the School of Computing Science at Simon Fraser University in Metro Vancouver, Canada. His research centres around the making of database systems in the context of modern hardware, new programming primitives, and new applications. His work also often extends to related areas such as operating systems, parallel programming and distributed systems. Tianzheng Wang received his Ph.D. (2017) and M.Sc. (2014) degrees in Computer Science from the University of Toronto, and B.Sc. (2012) in Computing degree (First Class Honours) from Hong Kong Polytechnic University. His work has been assimilated by cloud vendors and startups, and recognized by awards such as ACM SIGMOD Best Paper Award (2025), ACM SIGMOD Research Highlight Awards (2021 and 2023), and 2019 IEEE TCSC Award for Excellence in Scalable Computing (Early Career Researchers). |
Monday, March 16, 2026 at 10:30 a.m. in DC 1302
| Title | Eliminating Spurious Dependencies in Data: From Cleaning to Private Data Generation |
| Speaker | Mostafa Milani, Assistant Professor, Department of Computer Science, Western University |
| Abstract | Statistical dependencies embedded in data can reflect historical bias, measurement errors, or confounding effects. When such dependencies link sensitive attributes to outcomes in unintended ways, machine learning models trained on the data may inherit unfair or unstable behavior. While many fairness interventions operate at the model level, they leave the underlying data unchanged. This talk presents a data-centric approach that addresses unwanted dependencies directly at the level of data processing by enforcing conditional independence (CI) constraints. Two complementary settings are considered. First, a probabilistic data cleaning framework is introduced that corrects datasets violating CI constraints by learning an optimal transport map. This map modifies the empirical data distribution as little as possible while removing specified conditional dependencies. Second, a method is presented for enforcing CI during differentially private synthetic data generation by constraining the structure learning stage of private graphical models. This prevents the synthetic data from encoding prohibited dependency paths, while preserving both privacy guarantees and predictive utility. Together, these works demonstrate how fairness constraints can be formulated as structural constraints on statistical dependencies, and how they can be enforced both in observed data and in privacy-preserving data release. |
| Bio | Mostafa Milani is an Assistant Professor in the Department of Computer Science at Western University. His research focuses on data quality, data cleaning, and trustworthy data systems, with an emphasis on fairness and privacy in structured data. He previously completed postdoctoral research at the University of British Columbia and McMaster University. He has served on the program committees of leading conferences, including SIGMOD, VLDB, ICDE, and FAccT, and has taken on organizational roles such as Communications Chair at SIGMOD and Registration Chair at ICDE. |
Wednesday, March 18, 2026 at 10:15 a.m. in DC 3301 | DSG Lab (Note the atypical time and location)
| Title | Efficient Query Processing and Learning On Dirty Data |
| Speaker | Boris Glavic, Associate Professor, Department of Computer Science, University of Illinois at Chicago |
| Abstract | Data quality issues such as missing values, constraint violations, and outliers are prevalent in most real-world datasets. While the database community has developed a rich toolbox for addressing such data errors, the dominant practice is still to select a single “best-guess” repair that is then treated as gospel. However, the ground truth clean version of a dirty dataset is often unavailable, expensive to collect, or fundamentally non-identifiable from available observations. Seemingly reasonable cleaning choices embed hard-to-validate assumptions that, if violated, can lead to erroneous and misleading analysis outcomes. As the ground truth clean version of a dirty dataset is typically not recoverable, to trust any result of a computation over dirty data, it is necessary to reason about all possible repairs and derive sound bounds on the possible outcomes of the computation to certify its robustness or demonstrate that it is fundamentally too uncertain to be trusted. Unfortunately, this is computationally hard, even for relatively simple classes of computations and limited types of data errors. In this talk, I will provide an overview of our work on lightweight models for uncertain data that enable the compact representation of the set of all feasible repairs for a dirty dataset for a wide range of data quality issues. Our work is the first to provide efficient support for evaluating complex relational queries as well as machine learning training and inference over uncertain data. |
| Bio | Boris Glavic is an Associate Professor in the Department of Computer Science at the University of Illinois at Chicago, leading UIC’s DBGroup. His research spans several areas of database systems and data science, including data provenance, query execution and optimization, uncertain data, systems for ML, and data integration and cleaning. |
Wednesday, March 25, 2026 at 10:30 a.m.
| Title | to come |
| Speaker | Stratos Idreos, Professor, Harvard’s John A. Paulson School of Engineering and Applied Sciences; Faculty Co-Director, Harvard Data Science Initiative |
| Abstract | |
| Bio |
Monday, May 11, 2026 at 10:30 a.m.
| Title | |
| Speaker | |
| Abstract | |
| Bio |
Monday, June 15, 2026 at 10:30 a.m.
| Title | |
| Speaker | |
| Abstract | |
| Bio |
Monday, July 20, 2026 at 10:30 a.m.
| Title | |
| Speaker | |
| Abstract | |
| Bio |
Monday, August 17, 2026 at 10:30 a.m.
| Title | |
| Speaker | |
| Abstract | |
| Bio |