Data Systems Seminar Series (2022-2023)

The Data Systems Seminar Series provides a forum for presentation and discussion of interesting and current data systems issues. It complements our internal data systems meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.

The talks are usually held on a Monday at 10:30 am in room DC 1302. Exceptions are flagged. Due to Covid-19, some talks might be virtual over zoom; these will be identified

The talks are open to public. 

We will post the presentation videos whenever possible. Past DSG Seminar videos can be found on the DSG YouTube channel.

The Data Systems Seminar Series is supported by

Hannah Bast
Vasiliki Kalavri
Essam Mansour
Yang Cao
Justin Zobel
Jelle Hellings
Jian Pei
Arijit Khan
Oana Balmau

19 September 2022; 10:30AM (The seminar will be online; please use this link to join)

Title: The QLever SPARQL engine (and: how (not) to get practical work published) video
Speaker: Hannah Bast, Albert-Ludwigs-Universität Freiburg
Abstract: QLever is a new SPARQL engine, which can search very large knowledge graphs (100 billion triples and more) efficiently with very moderate resources (a standard PC is enough). QLever features live autocompletion, a text search component, and support for difficult geographic queries and the interactive visualization of their results. Building such an engine from the ground up is a lot of work, but also very rewarding. I will give you a guided tour with many demos and various glimpses under the hood, with examples of clever algorithms, algorithm engineering, and modern C++. I will also talk about the meta topics of reproducibility and how (not) to get work of this kind published.
Bio: Hannah Bast started programming early in her live and loved mathematics, so studied mathematics and computer science and stumbled into a career in theoretical computer science. Reviewers liked her theoretical work, but she wasn't too happy with doing arts only for art's sake. She then regressed to more practical work again, which she likes a lot, but reviewers not so much. Since 2009, she is a full professor at the University of Freiburg. She was a dean, member of the AI commission of the German parliament, and worked at Google, creating a new route planner for Google Maps. She enjoys life despite its absurdity.

17 October 2022; 10:30AM

Title: Efficient collaborative analytics with no information leakage: An idea whose time has come video
Speaker: Vasiliki Kalavri, Boston University
Abstract: Enabling secure outsourced analytics with practical performance has been a long-standing research challenge in the database community. In this talk, I will present our work towards realizing this vision with Secrecy, a new framework for secure relational analytics in untrusted clouds. Secrecy targets offline collaborative analytics, where data owners (hospitals, companies, research institutions, or individuals) are willing to allow certain computations on their collective private data, provided that data remain siloed from untrusted entities. To ensure no information leakage and provable security guarantees, Secrecy relies on cryptographically secure Multi-Party Computation (MPC). Instead of treating MPC as a black box, like prior works, Secrecy exposes the costs of oblivious queries to the planner and employs novel logical, physical, and protocol-specific optimizations, all of which are applicable even when data owners do not participate in the computation. As a result, Secrecy outperforms state-of-the-art systems and can comfortably process much larger datasets with good performance and modest use of resources.
Bio: Vasiliki (Vasia) Kalavri is an Assistant Professor of Computer Science at Boston University, where she leads the Complex Analytics and Scalable Processing (CASP) Systems lab. Vasia and her team enjoy doing research on multiple aspects of (distributed) data-centric systems. Recently, they have been working on self-managed systems for data stream processing, systems for scalable graph ML, and MPC systems for private collaborative analytics. Before joining BU, Vasia was a postdoctoral fellow at ETH Zurich and received a joint PhD from KTH (Sweden) and UCLouvain (Belgium).

21 November 2022; 10:30AM; DC 1304 (Note the different room)


Linked Data Science: Systems and Applications video

Speaker: Essam Mansour, Concordia University
Abstract: In recent years, we have witnessed a growing interest in data science not only from academia but particularly from companies investing in data science platforms to analyze large amounts of data. In this process, a myriad of data science artifacts, such as datasets and pipeline scripts, are created. Yet, there has so far been no systematic attempt to holistically exploit the collected knowledge and experiences that are implicitly contained in the specification of these pipelines, e.g., compatible datasets, cleansing steps, ML algorithms, parameters, etc.

Instead, data scientists still spend a considerable amount of their time trying to recover relevant information and experiences from colleagues, trial and error, lengthy exploration, etc. In this paper, we, therefore, propose a novel system (KGLiDS) that employs machine learning to extract the semantics of data science pipelines and captures them in a knowledge graph, which can then be exploited to assist data scientists in various ways. This abstraction is the key to enabling Linked Data Science since it allows us to share the essence of pipelines between platforms, companies, and institutions without revealing critical internal information and instead focusing on the semantics of what is being processed and how. Our comprehensive evaluation uses thousands of datasets and more than thirteen thousand pipeline scripts extracted from data discovery benchmarks and the Kaggle portal and shows that KGLiDS significantly outperforms state-of-the-art systems on related tasks, such as dataset recommendation and pipeline classification.

Bio: Dr. Essam Mansour has been an assistant professor since 2019 in the Department of Computer-Science and Software Engineering (CSSE) at Concordia University in Montreal, and the head of the Cognitive Data Science lab (CoDS). His research program focuses on developing Cognitive Data Science Platforms for federated and big datasets. His research interests are in the broad areas of parallel/distributed systems, data management, knowledge graphs, and graph neural networks. Essam spent more than 10 years doing world-class research, in the areas of databases, parallel/distributed systems, big data analytics, and querying geo-distributed graphs. He is developing and optimizing big data systems to work at scale on supercomputers and cloud resources. During these years, his research contributions have led to more than 30 conference and journal papers (mostly in top-tier venues, such as VLDBJ, PVLDB, SIGMOD, ICDE, EDBT, and CIKM). He has been invited as a reviewer for top journals, such as ACM Transactions on Database Systems (TODS), VLDB Journal, and IEEE Transactions on Knowledge and Data Engineering (TKDE). Essam also has served as a program committee member in several top conferences, such as VLDB 2016 to 2023, SIGMOD 2023, and ICDE 2016.

16 January 2023; 10:00AM (Note the different time - The seminar will be online; please register to join)


Towards Differentially Private Federated Learning with Untrusted Servers video

Speaker: Yang Cao, Hokkaido University
Abstract: Federated learning has received increasing attention in academia and industry as a new privacy-preserving machine learning paradigm. Unlike traditional machine learning, which requires data collection before training, in federated learning, the clients collaboratively train a model under the coordination of a central server. In particular, the clients only share model updates to the server, and all raw data are stored locally. However, recent studies showed that the model updates might reveal sensitive information to the server. In addition, federated learning itself does not guarantee formal privacy. This talk will review recent advances on differentially private federated learning under untrusted servers, introduce our attempts towards this goal by leveraging LDP, the shuffle model of DP and TEE, and discuss some open problems.
Bio: Yang Cao is an Associate Professor in the Division of Computer Science and Information Technology at Hokkaido University. He earned his Ph.D. from the Graduate School of Informatics, Kyoto University, in 2017. His research interests lie in the intersections between databases, security, and machine learning. He has published many papers in these areas, including top venues such as VLDB, SIGMOD, ICDE, AAAI, TKDE, and USENIX Security. Two of his papers were selected as one of the best paper finalists in ICDE 2017 and ICME 2020. He is a recipient of the IEEE Computer Society Japan Chapter Young Author Award 2019, Database Society of Japan Kambayashi Young Researcher Award 2021.

1 February 2023; 10:030AM


The Getting of Knowledge: Search and the Global Information Ecology 

Speaker: Justin Zobel, University of Melbourne
Abstract: Search technology, originally developed in the field of information retrieval as a computational replacement for the physical indexes used for libraries, is today a key enabler of the embedding of online activity in our lives. In combination with the Web, it has led to the emergence of what might be called the information ecology. This ecology not only adapts to how it is used – collectively and by individuals – but is leading to human adaptation as well, changing our activity in unexpected ways. In this lecture, I reflect on how the field of information retrieval might be defined and understood, an information-ecology perspective on our online experience, and the ways in which search might continue to develop. These reflections illustrate how a human-centric examination of search and information retrieval can suggest socially focused research questions as well as directions for refinement of the technology.
Bio: Professor Justin Zobel is a Pro Vice-Chancellor at the University of Melbourne and a Redmond Barry Distinguished Professor, and was recently elected to the SIGIR Academy. He completed his PhD in Computer Science at Melbourne and for many years led the Search Engine group at RMIT University before returning to Melbourne in 2008. In the research community, Professor Zobel is best known for his role in the development of algorithms for efficient web search. His current research areas include search, algorithms and data structures, and measurement. He is the author of three highly regarded textbooks on graduate study and research methods.

25 April 2023; 10:30PM; DC 2585 (Please note changed room)

Title: Resilient Data Management Systems: Challenges and Opportunities 
Speaker: Jelle Hellings, McMaster University 
Abstract: The emergence of blockchain technology is fueling the development of new blockchain-based resilient data management systems (BC-RDMSs) that can manage data between fully-independent parties (federated data management) and provide resilience to Byzantine failures (e.g., hardware failures, software failures, and malicious behavior). Due to these qualities, the usage of BC-RDMSs has been proposed in areas such as finance, health care, IoT, agriculture, and fraud-prevention.

At their core, these BC-RDMSs are distributed fully-replicated systems in which each participant maintains a copy of a ledger that stores an append-only list of all transactions requested by users and executed by the system. This ledger is constructed and stored in a tamper-proof manner: new transactions can only be appended via consensus-based agreement steps that require support of a majority of replicas, ruling out unintended changes due to a minority of faulty replicas. As the ledger is maintained by all replicas, it is stored in a highly redundant manner and will survive even if individual replicas fail.

One the one hand, current BC-RDMSs provide novel guarantees not provided by traditional database systems. On the other hand, the current consensus-based techniques used for BC-RDMSs are costly and put heavy restrictions on the operations of these systems, limiting the practical usage of BC-RDMSs. In this talk, I will provide a tour of BC-RDMSs: we will look at how they work, at their limitations, and how ongoing research aims at improving BC-RDMSs.

Bio Jelle Hellings is currently an Assistant Professor at McMaster University. His work is centered around novel directions for high-performance data management systems. Currently, his focus is on the development of scalable resilient systems that can deal with faulty behavior (e.g., hardware failures, software failures, and malicious attacks). Furthermore, he is interested in graph databases, database theory, and algorithms, data structures, and optimization techniques to enable efficient query processing.  Previously, Jelle worked as a Postdoc Fellow at the University of California, Davis, where he focused on the theoretical aspects of resilient systems. Before, Jelle was a PhD student in the Databases and Theoretical Computer Science research group at Hasselt University, where he worked on database theory and graph query languages. Recently, Jelle has served in the program committees of ACM SIGMOD, IEEE ICDE, ACM DEBS, IEEE ICDCS, and IEEE DAPPS and reviewed for the conferences ACM PODS, ACM/IEEE LICS, ICDT, EATCS ICALP and several journals such as ACM TODS, VLDBJ, TCS, IEEE TDSS, and IEEE TNSM.

15 May 2023; 10:30AM

Title: Data and AI Model Markets: Grand Opportunities for Facilitating Sharing, Discovery, and Integration in Data and AI Economies video
Speaker: Jian Pei, Duke University
Abstract: Data and AI model sharing has been a long time bottleneck for AI and data economies. In this talk, I will argue that data and AI model discovery and integration are foundations for effective sharing. I will also revisit why sharing remains a big challenge and why many existing approaches like data warehouses, data lakes, federated databases, and federated learning are still far from enough to solve the problem, particularly for sharing among organizations. Then, I will advocate data and AI markets as a potential grand opportunity for data and AI model sharing at scale, particularly for inter-organization sharing. Using some recent studies I will demonstrate some exciting technical problems in data and AI model markets for the database and data science communities. I will also offer my humble views on the future directions on data and AI model markets.
Bio: Jian Pei is a Professor at Duke University. His research focuses on data science, data mining, database systems, machine learning, and information retrieval. With his expertise in developing data science principles and techniques for novel data-driven and data-intensive applications and transferring them to products and business practice, he has been recognized as a Fellow of the Royal Society of Canada, the Canadian Academy of Engineering, ACM, and IEEE. He received several prestigious awards, such as the 2017 ACM SIGKDD Innovation Award, the 2015 ACM SIGKDD Service Award, and the 2014 IEEE ICDM Research Contributions Award. He has previously served as the chair of ACM SIGKDD and as the Editor-in-Chief of IEEE TKDE.

24 May 2023; 11:00AM (Note the later start time)

Title: Data Management for Emerging Problems in Large Networks video
Speaker: Arijit Khan, Aalborg University
Abstract: Graphs are widely used in many application domains, including social networks, knowledge graphs, biological networks, software collaboration, geo-spatial road networks, interactive gaming, among many others. One major challenge for graph querying and mining is that non-professional users are not familiar with the complex schema and information descriptions. It becomes hard for users to formulate a query (e.g., SPARQL or exact subgraph pattern) that can be properly processed by the existing systems. As an example, Freebase that powers Google’s knowledge graph alone has over 22 million entities and 350 million relationships in about 5428 domains. Before users can query anything meaningful over this data, they are often overwhelmed by the daunting task of attempting to even digest and understand it. Without knowing the exact structure of the data and the semantics of the entity labels and their relationships, can we still query them and obtain the relevant results? In this talk, I shall give an overview of our user-friendly, embedding-based, scalable techniques and systems for querying big graphs, including knowledge graphs.
Bio: Arijit Khan is an IEEE senior member, ACM distinguished speaker, and an associate professor in the Department of Computer Science, Aalborg University, Denmark. He earned his PhD from the Department of Computer Science, University of California, Santa Barbara, USA, and did a post-doc in the Systems group at ETH Zurich, Switzerland. He has been an assistant professor in the School of Computer Science and Engineering, Nanyang Technological University, Singapore. Arijit is the recipient of the prestigious IBM PhD Fellowship in 2012-13. He published more than 70 papers in premier databases and data mining conferences and journals including ACM SIGMOD, VLDB, IEEE TKDE, IEEE ICDE, SIAM SDM, USENIX ATC, EDBT, The Web Conference (WWW), ACM WSDM, and ACM CIKM. Arijit co-presented tutorials on emerging graph queries, applications, and big graph systems at VLDB (2017, 2015, and 2014), ACM CIKM (2022), and at IEEE ICDE 2012. He served in the program committee of ACM KDD, ACM SIGMOD, VLDB, IEEE ICDE, IEEE ICDM, EDBT, ACM CIKM, and in the senior program committee of WWW. Arijit served as the co-chair of Big-O(Q) workshop co-located with VLDB 2015, wrote a book on uncertain graphs in Morgan & Claypool’s Synthesis Lectures on Data Management. He contributed invited chapters and articles on big graphs querying and mining in the ACM SIGMOD blog, Springer Handbook of Big Data Technologies, and in Springer Encyclopedia of Big Data Technologies. He was invited to give tutorials and talks across 10 countries, including in the National Institute of Informatics(NII) Shonan Meeting on "Graph Database Systems: Bridging Theory, Practice, and Engineering", 2018, Japan, Asia Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data (APWeb- WAIM 2017), International Conference on Management of Data (COMAD 2016), and in the Dagstuhl Seminar on graph algorithms and systems, 2014 and 2019, Schloss Dagstuhl - Leibniz Center for Informatics, Germany. Dr Khan is serving as an associate editor of IEEE TKDE 2019-now, proceedings chair of EDBT 2020, and IEEE ICDE TKDE poster track co-chair 2023.

29 May 2023; 10:30AM

Title: Characterizing Machine Learning I/O with MLPerf Storage video
Speaker: Oana Balmau, McGill University
Abstract: Data is the driving force behind machine learning (ML) algorithms. The way we ingest, store, and serve data can impact end-to-end training and inference performance significantly. For instance, as much as 50% of the power can go into storage and data cleaning in large production settings. The amount of data that we produce is growing exponentially, making it expensive and difficult to keep entire training datasets in main memory. Increasingly, ML algorithms will need to access data directly from persistent storage in an efficient manner. To address this challenge, this work sets out to characterize I/O patterns in ML, with a focus on data pre-processing and training.

We use trace collection to understand storage impact in ML. Key factors we are investigating include the workload type, software framework used (e.g., PyTorch, Tensorflow), accelerator type (e.g., GPU, TPU), dataset size to memory ratio, and degree of parallelism. The trace collection is done mainly through eBPF and other system monitoring tools such as mpstat, and NVIDIA Nsight. Our traces include VFS-layer calls such as read, write, open, create, etc. as well as mmap calls, block I/O accesses, CPU use, memory use, and accelerator use. Based on the trace analysis, we plan to build a synthetic I/O workload generator. The workload generator will accurately reproduce I/O patterns for representative ML workloads, simulating the computation time.

Bio: Oana Balmau is an Assistant Professor in the School of Computer Science at McGill University, where she leads the DISCS Lab. Her research is focused on storage systems and data management systems, in particular for workloads in machine learning, data science, and edge computing. Oana completed her PhD in Computer Science at the University of Sydney, and earned her Bachelors and Masters degrees from EPFL, Switzerland. Oana's doctoral dissertation won the CORE John Makepeace Bennet Award 2021 for the best computer science dissertation in Australia and New Zealand and an Honorable Mention for the ACM SIGOPS Dennis M. Ritchie Doctoral Dissertation Award. Finally, Oana is a part of MLCommons, where she leads the effort on storage benchmarking in for machine learning.

12 June 2023; 10:30AM

Title: TBD video
Speaker: TBD
Abstract: TBD
Bio: TBD