Data Systems Seminar Series (2023-2024)

The Data Systems Seminar Series provides a forum for presentation and discussion of interesting and current data systems issues. It complements our internal data systems meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.

The talks are usually held on a Monday at 10:30AM in room DC 1304. Exceptions are flagged. Due to Covid-19, some talks might be virtual over zoom; these will be identified

The talks are open to public. 

We will post the presentation videos whenever possible. Past DSG Seminar videos can be found on the DSG YouTube channel.

The Data Systems Seminar Series is supported by

Reynold Cheng
Eric Lo
Patrick Valduriez
Axel Ngonga
AnHai Doan
Ihab Ilyas
Faisal Nawab

22 August 2023; 10:30AM

Title: HINCare: Data-Driven Volunteering notes
Speaker: Reynold C.K. Cheng, University of Hong Kong
Abstract: In Hong Kong, the number of elderly citizens is estimated to rise to one third of the population, or 2.37 million, in year 2037. As they age and become more frail, the demand for formal support services (e.g., providing domestic or escort services) will increase significantly in the coming years. However, there is a severe lack of manpower to meet these needs. Some elderly-care homes reported a 70% shortage of employees. There is thus a strong need of voluntary or part-time helpers for taking care of elders.

In this talk, I will introduce HINCare, a software platform that encourages mutual-help and volunteering culture in the community. HINCare uses the HIN (Heterogeneous Information Network) to recommend helpers to elders or other service recipients. The algorithms that use HINs and AI technologies for matching elders and helpers are based on our recent research results. This is the first time that HIN is used to support elderly care.

HINCare is now downloadable in Apple and Google Play Store, and has been serving more than a thousand of elders and helpers in NGOs (e.g., SKH and CSFC). The app is originally designed for elderly users, but has now expanded its services to support the Community Investment and Inclusion Fund (CIIF) and 14 NGOs engaged in teenage and family services. The system won the HKICT Award 2021, two Asia Smart App Awards (2021 and 2020), and the HKU Faculty Knowledge Exchange Awards 2021 HKU.

Bio: Reynold Cheng is a Professor of the Department of Computer Science in the University of Hong Kong (HKU). His research interests are in data science, big graph analytics and uncertain data management. He was the Assistant Professor in the Department of Computing of the Hong Kong Polytechnic University (HKPU) from 2005 to 2008. He received his BEng (Computer Engineering) in 1998, and MPhil (Computer Science and Information Systems) in 2000 from HKU. He then obtained his MSc and PhD degrees from Department of Computer Science of Purdue University in 2003 and 2005.

Reynold is listed as the World's Top 2% Scientists by Stanford University in 2022, and is named the 2023 AI 2000 Most Influential Scholar Honorable Mention in Database. He received the SIGMOD Research Highlights Reward 2020, HKICT Awards 2021, and HKU Knowledge Exchange Award (Engineering) 2021. He was granted an Outstanding Young Researcher Award 2011-12 by HKU. He received the Universitas 21 Fellowship in 2011, and two Performance Awards from HKPU Computing in 2006 and 2007. He is an academic advisor to the College of Professional and Continuing Education of HKPU. He is a member of IEEE, ACM, ACM SIGMOD, and UPE. He was a PC co-chair of IEEE ICDE 2021, and has been serving on the program committees and review panels for leading database conferences and journals like SIGMOD, VLDB, ICDE, KDD, IJCAI, AAAI, and TODS. He is on the editorial board of KAIS, IS and DAPD, and was a former editorial board member of TKDE.

24 August 2023; 11:00AM (Note the later start time)

Title: When Private Blockchain Meets Deterministic Database notes
Speaker: Eric Lo, Chinese University of Hong Kong
Abstract: Private blockchain as a replicated transactional system shares many commonalities with a distributed database. However, the intimacy between a private blockchain and a deterministic database has never been studied. In essence, both private blockchains and deterministic databases ensure replica consistency by determinism. While private blockchains have started to pursue deterministic transaction executions recently, deterministic databases have already studied deterministic concurrency control protocols for almost a decade. In this talk, I will present Harmony, a novel deterministic concurrency control protocol for blockchain use. We use Harmony to build a new relational blockchain, namely HarmonyBC, which features low abort rates, hotspot resiliency, and inter-block parallelism, all of which are especially important to disk-oriented blockchains.
Bio: Eric Lo is an Associate Professor of Computer Science and Engineering at the Chinese University of Hong Kong (CUHK). He holds an M.Phil degree from the University of Hong Kong, which he earned in 2005, and a PhD degree from ETH Zurich, which he obtained in 2007. Prior to joining CUHK, he worked at both Google and Microsoft. His recent research is centered around AI systems and geo-distributed databases. He has served on the program committee of all major data engineering conferences and currently holds the position of Associate Editor at PVLDB. His research has been recognized with best paper awards and best paper honorable mentions at conferences such as VLDB'05 and ICDE'12. Recently, his work received the ACM SIGMOD Research Highlight Award 2020.

5 September 2023; 10:30AM

Title: Life Science Workflow Services (LifeSWS): Motivations and Architecture notes
Speaker: Patrick Valduriez, Inria, University of Montpellier, CNRS, LIRMM
Abstract: Data driven science requires manipulating large datasets coming from various data sources through complex workflows based on a variety of models and languages. With the increasing number of big data sources and models developed by different groups, it is hard to relate models and data and use them in unanticipated ways for specific data analysis. Current solutions are typically ad-hoc, specialized for particular data, models and workflow systems. In this talk, we focus on data driven life science and propose an open service-based architecture, Life Science Workflow Services (LifeSWS), which provides data analysis workflow services for life sciences. We illustrate our motivations and rationale for the architecture with real use cases from life science.
Bio: Patrick Valduriez is a director of research emeritus at Inria, France, the scientific director of the Inria-Brasil international lab. and the Chief Scientist Officer of the LeanXcale company (that delivers a NewSQL database).

He is currently a member of the Zenith team (between Inria and University of Montpellier at the LIRMM lab.), focusing on data science, in particular data management in large-scale distributed and parallel systems and scientific data management. He has authored and co-authored more than 400 technical papers and several textbooks, among which “Principles of Distributed Database Systems” (with Professor Tamer Özsu, University of Waterloo). He currently serves as associate editor of the Distributed and Parallel Databases journal. He has served as PC chair of major conferences such as SIGMOD and VLDB. He was the general chair of SIGMOD 2004, EDBT 2008 and VLDB 2009.

He received several best paper awards, including VLDB 2000. He was the recipient of the 1993 IBM scientific prize in Computer Science in France and the 2014 Innovation Award from Inria and the French Academy of Science. He is an ACM Fellow since 2013.

25 September 2023; 10:30AM

Title: Class Expression Learning with Multiple Representations 
Speaker: Axel-Cyrille Ngonga Ngomo, Paderborn University
Abstract: RDF knowledge bases are now first-class citizens of the Web with over 100 billion RDF assertions in the 2022 WebDataCommons crawl. Developing explainable machine learning approaches tailored towards this data is hence a task of increasing importance. In this talk, we focus on class expression learning (also called concept learning) on large RDF knowledge graphs. We begin by presenting approaches based on refinement operators, the most common family of solutions for concept learning. We then continue by presenting the most important performance bottlenecks of concept learning based on refinements. We then present recent works that address each of those bottlenecks using a dedicated representation (e.g., neural, symbolic). We conclude by presenting some open challenges in concept learning and related areas.
Bio: Axel Ngonga is a professor at Paderborn University, where he heads the Data Science Group. He studied Computer Science in Leipzig. In his PhD thesis, he developed knowledge-poor methods for the extraction of taxonomies from large text corpora. After completing his PhD, he wrote a Habilitation on link discovery with a focus on machine learning and runtime optimization. In his current research, he focuses on data-driven methods to improve the lifecycle of knowledge graphs. These include techniques for the extraction of knowledge graphs, the verification of their veracity, their integration and fusion, their use in machine learning, and their exploitation in user-facing applications such as question answering systems and chatbots. He is the grateful recipient of over 25 international research prizes, including a Next Einstein Fellowship and 6 best research paper awards.

16 October 2023; 10:30AM

Title: What is Next for Data Integration? 
Speaker: AnHai Doan, University of Wisconsin
Abstract: Data powers the modern world. Consequently, there is a critical need to integrate data, so that it can be used to fuel a wide variety of applications. Yet data integration R&D, in both academia and industry, has fallen short in addressing this need. In this talk I discuss why, drawing from my experience in the past four years in industry, when I worked as a VP of Technology at Informatica, leading a team that applied machine learning and database techniques to build data integration solutions. I propose what we can do as a field going forward, and discuss the potential impacts of the latest trend, large language models, on the field.
Bio: AnHai Doan is Vilas Distinguished Achievement Professor and Gurindar S. Sohi Professor of Computer Science at the University of Wisconsin-Madison. He has worked on data integration challenges for more than 20 years, in both academia and industry. His work has been recognized by the ACM Doctoral Dissertation Award, CAREER Award, Sloan Fellowship, ACM Research Highlight, and others, and has been commercialized via several startups, impacting millions of users and thousands of enterprises. AnHai has served on the SIGMOD Advisory Committee, SIGMOD Executive Committee, and co-chaired SIGMOD-2020.

4 December 2023; 10:30AM (DC 1302 - Note the room change)

Title: Structured Knowledge and Data Management for Effective AI Systems 
Speaker: Ihab Ilyas, University of Waterloo
Abstract: Can structured data management play an important role in accelerating AI? In this talk I focus on two main aspects of structured data management and argue that they are key in powering and accelerating AI application development: 1) Automating data quality and cleaning using generative models; and 2) constructing and serving structured knowledge graphs and their role in semantic annotation and grounding unstructured data.In the first thrust, I will summarize our findings building the HoloClean project. HoloClean builds generative probabilistic models describing how data was intended to look like, and use them for predicting errors and repairs.On the structured knowledge front, I will describe our work building Saga, an end-to-end platform for incremental and continuous construction of large scale knowledge graphs. Saga demonstrates the complexity of building such platform in industrial settings with strong consistency, latency, and coverage requirements. I will discuss challenges around building entity linking and fusion pipelines for constructing coherent knowledge graphs; updating the knowledge graphs with real-time streams; and finally, exposing the constructed knowledge via ML-based entity disambiguation and semantic annotation. I will also show how to query such knowledge via vector representation capable of handling hybrid similarity/filtering workloads.
Bio: Ihab Ilyas is a professor in the Cheriton School of Computer Science and the NSERC-Thomson Reuters Research Chair on data quality at the University of Waterloo. He is currently on leave as a Distinguished Engineer at Apple, where he lead the Knowledge Graph Platform team. His main research focuses on data science and and data management, with special interest in data cleaning and integration, knowledge construction, and machine learning for structured data management. Ihab is a co-founder of Tamr, a startup focusing on large-scale data integration, and he is also the co-founder of inductiv (acquired by Apple), a Waterloo-based startup on using AI for structured data cleaning. He is a recipient of the Ontario Early Researcher Award, a Cheriton Faculty Fellowship, an NSERC Discovery Accelerator Award, and a Google Faculty Award, and he is an ACM Fellow and an IEEE Fellow.

11 December 2023; 10:30PM

Title: Enabling Emerging Edge and IoT Applications with Edge-Cloud Data Management 
Speaker: Faisal Nawab, University of California - Irvine
Abstract: The potential of Edge and IoT applications encompasses realms like smart cities, mobility solutions, and immersive technologies. Yet, the actualization of these promising applications stumbles upon a fundamental impediment: the prevailing cloud data management technologies are often tethered to remote data centers. This architectural choice introduces daunting challenges, including substantial wide-area latency, burdensome connectivity and communication bandwidth demands, and regulatory constraints related to personal and sensitive data.

This talk presents our research in introducing edge-cloud data management that provides a framework for managing data across edge nodes to overcome the limits of cloud-only data management. We encounter various challenges to achieving this vision such as managing the sheer amount of edge nodes, their sporadic availability, and device constraints in terms of compute, storage, and trust. To navigate these multifaceted challenges, our work redesigns distributed data management technologies to adapt to the edge environment. This includes introducing design concepts in the domains of hierarchical and asymmetric edge-cloud data management, decentralized edge coordination techniques, and edge-friendly mechanisms to maintain security and trust. The talk includes a demonstration of 'AnyLog'–an edge-cloud data management solution that integrates our research findings.
Bio Faisal Nawab is an assistant professor in the computer science department at the University of California, Irvine. He is the director of EdgeLab, which is dedicated to building edge-cloud data management solutions for emerging edge and IoT applications. Faisal's research is influenced by practical industry problems through his involvement with the startup 'AnyLog' where he acts as the lead architect of designing an edge-cloud database. Faisal has received recognition for his work, winning the "Next-Generation Data Infrastructure" award from Facebook, being named the runner-up for the IEEE TEMS Blockchain Early-Career Award, and being awarded several NSF grants, and industry funding from Meta and Roblox.

15 January 2024; 10:30AM

Title: TBD video
Speaker: TBD
Abstract: TBD
Bio: TBD

29 April 2024; 10:30AM

Title: TBD video
Speaker: TBD
Abstract: TBD
Bio: TBD

27 May 2024; 10:30AM

Title: TBD video
Speaker: TBD
Abstract: TBD
Bio: TBD

24 June 2024; 10:30AM

Title: TBD video
Speaker: TBD
Abstract: TBD
Bio: TBD