Data Systems Seminar Series (2022-2023)

The Data Systems Seminar Series provides a forum for presentation and discussion of interesting and current data systems issues. It complements our internal data systems meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.

The talks are usually held on a Monday at 10:30 am in room DC 1302. Exceptions are flagged. Due to Covid-19, some talks might be virtual over zoom; these will be identified

The talks are open to public. 

We will post the presentation videos whenever possible. Past DSG Seminar videos can be found on the DSG YouTube channel.


The Data Systems Seminar Series is supported by


Hannah Bast
Vasiliki Kalavri
Essam Mansour
Çağatay Demiralp
Jyoti Leeka
Yang Cao
-
-
-

19 September 2022; 10:30AM (The seminar will be online; please use this link to join)

Title: The QLever SPARQL engine (and: how (not) to get practical work published) video
Speaker: Hannah Bast, Albert-Ludwigs-Universität Freiburg
Abstract: QLever is a new SPARQL engine, which can search very large knowledge graphs (100 billion triples and more) efficiently with very moderate resources (a standard PC is enough). QLever features live autocompletion, a text search component, and support for difficult geographic queries and the interactive visualization of their results. Building such an engine from the ground up is a lot of work, but also very rewarding. I will give you a guided tour with many demos and various glimpses under the hood, with examples of clever algorithms, algorithm engineering, and modern C++. I will also talk about the meta topics of reproducibility and how (not) to get work of this kind published.
Bio: Hannah Bast started programming early in her live and loved mathematics, so studied mathematics and computer science and stumbled into a career in theoretical computer science. Reviewers liked her theoretical work, but she wasn't too happy with doing arts only for art's sake. She then regressed to more practical work again, which she likes a lot, but reviewers not so much. Since 2009, she is a full professor at the University of Freiburg. She was a dean, member of the AI commission of the German parliament, and worked at Google, creating a new route planner for Google Maps. She enjoys life despite its absurdity.

17 October 2022; 10:30AM; DC 1302

Title: Efficient collaborative analytics with no information leakage: An idea whose time has come video
Speaker: Vasiliki Kalavri, Boston University
Abstract: Enabling secure outsourced analytics with practical performance has been a long-standing research challenge in the database community. In this talk, I will present our work towards realizing this vision with Secrecy, a new framework for secure relational analytics in untrusted clouds. Secrecy targets offline collaborative analytics, where data owners (hospitals, companies, research institutions, or individuals) are willing to allow certain computations on their collective private data, provided that data remain siloed from untrusted entities. To ensure no information leakage and provable security guarantees, Secrecy relies on cryptographically secure Multi-Party Computation (MPC). Instead of treating MPC as a black box, like prior works, Secrecy exposes the costs of oblivious queries to the planner and employs novel logical, physical, and protocol-specific optimizations, all of which are applicable even when data owners do not participate in the computation. As a result, Secrecy outperforms state-of-the-art systems and can comfortably process much larger datasets with good performance and modest use of resources.
Bio: Vasiliki (Vasia) Kalavri is an Assistant Professor of Computer Science at Boston University, where she leads the Complex Analytics and Scalable Processing (CASP) Systems lab. Vasia and her team enjoy doing research on multiple aspects of (distributed) data-centric systems. Recently, they have been working on self-managed systems for data stream processing, systems for scalable graph ML, and MPC systems for private collaborative analytics. Before joining BU, Vasia was a postdoctoral fellow at ETH Zurich and received a joint PhD from KTH (Sweden) and UCLouvain (Belgium).

21 November 2022; 10:30AM 

Title:

Data Science powered by knowledge graphs video

Speaker: Essam Mansour, Concordia University
Abstract: In recent years, we have witnessed a growing interest in data science not only from academia but particularly from companies investing in data science platforms to analyze large amounts of data. In this process, a myriad of data science artifacts, such as datasets and pipeline scripts, are created. Yet, there has so far been no systematic attempt to holistically exploit the collected knowledge and experiences that are implicitly contained in the specification of these pipelines, e.g., compatible datasets, cleansing steps, ML algorithms, parameters, etc.

Instead, data scientists still spend a considerable amount of their time trying to recover relevant information and experiences from colleagues, trial and error, lengthy exploration, etc. In this paper, we, therefore, propose a novel system (KGLiDS) that employs machine learning to extract the semantics of data science pipelines and captures them in a knowledge graph, which can then be exploited to assist data scientists in various ways. This abstraction is the key to enabling Linked Data Science since it allows us to share the essence of pipelines between platforms, companies, and institutions without revealing critical internal information and instead focusing on the semantics of what is being processed and how. Our comprehensive evaluation uses thousands of datasets and more than thirteen thousand pipeline scripts extracted from data discovery benchmarks and the Kaggle portal and shows that KGLiDS significantly outperforms state-of-the-art systems on related tasks, such as dataset recommendation and pipeline classification.

Bio: Dr. Essam Mansour has been an assistant professor since 2019 in the Department of Computer-Science and Software Engineering (CSSE) at Concordia University in Montreal, and the head of the Cognitive Data Science lab (CoDS). His research program focuses on developing Cognitive Data Science Platforms for federated and big datasets. His research interests are in the broad areas of parallel/distributed systems, data management, knowledge graphs, and graph neural networks. Essam spent more than 10 years doing world-class research, in the areas of databases, parallel/distributed systems, big data analytics, and querying geo-distributed graphs. He is developing and optimizing big data systems to work at scale on supercomputers and cloud resources. During these years, his research contributions have led to more than 30 conference and journal papers (mostly in top-tier venues, such as VLDBJ, PVLDB, SIGMOD, ICDE, EDBT, and CIKM). He has been invited as a reviewer for top journals, such as ACM Transactions on Database Systems (TODS), VLDB Journal, and IEEE Transactions on Knowledge and Data Engineering (TKDE). Essam also has served as a program committee member in several top conferences, such as VLDB 2016 to 2023, SIGMOD 2023, and ICDE 2016.

5 December 2022; 10:30AM

Title:

TBD video

Speaker: Çağatay Demiralp, Sigma Computing
Abstract: TBD
Bio: TBD

10 January 2023; 10:30AM

Title:

Query Optimizer as a Service video

Speaker: Jyoti Leeka, Microsoft Gray Systems Lab.
Abstract: Query optimization is a critical technology needed by all modern data processing systems. However, it is traditionally implemented in silos and is deeply embedded in different systems. Furthermore, over the years, query optimizers have become less understood and rarely touched pieces of code that are brittle to changes and very expensive to maintain, thus slowing down the pace of innovation. In this talk, I will argue that it is time to design query optimizer as a service in modern cloud architectures. Such a design will help build a set of well-maintained optimizations that are externalized from the query engines and that could be learned (and improved) using the large workloads present in modern clouds. I will present a reference architecture for our query optimizer as a service, explaining details of intra-query and inter-query optimizations performed. A key enabler for the externalization of intra-query optimization is the plethora of recent machine learning-based techniques developed to improve query optimizer components, such as cardinality, cost model, and query planner. On the other hand, externalization of inter-query optimization, also known as multi-query optimization, is motivated by numerous efforts on view materialization, physical layouts (i.e., partitioning, etc.), and most recently by Pipemizer, a data pipeline-aware optimization effort at Microsoft.
Bio: Jyoti Leeka is a Senior Scientist currently focusing on improving the performance of Microsoft’s large-scale data-intensive production analytics clusters. These clusters comprise of 300k servers running hundreds of thousands of production analytic jobs on a daily basis; written by thousands of developers, processing several exabytes of data per day, and involving several hundred petabytes of I/O. The main focus of this work is to develop algorithms to find optimal/approximate physical designs for Microsoft’s production job pipelines. Before joining GSL, Jyoti was a postdoctoral researcher at MSR for two years. Her focus was on query optimization for distributed systems.

16 January 2023; 10:00AM (The seminar will be online; please use this link to join; note earlier time)

Title:

Towards Differentially Private Federated Learning with Untrusted Servers video

Speaker: Yang Cao, Kyoto University
Abstract: Federated learning has received increasing attention in academia and industry as a new privacy-preserving machine learning paradigm. Unlike traditional machine learning, which requires data collection before training, in federated learning, the clients collaboratively train a model under the coordination of a central server. In particular, the clients only share model updates to the server, and all raw data are stored locally. However, recent studies showed that the model updates might reveal sensitive information to the server. In addition, federated learning itself does not guarantee formal privacy. This talk will review recent advances on differentially private federated learning under untrusted servers, introduce our attempts towards this goal by leveraging LDP, the shuffle model of DP and TEE, and discuss some open problems.
Bio: Yang Cao is an Associate Professor in the Division of Computer Science and Information Technology at Hokkaido University. He earned his Ph.D. from the Graduate School of Informatics, Kyoto University, in 2017. His research interests lie in the intersections between databases, security, and machine learning. He has published many papers in these areas, including top venues such as VLDB, SIGMOD, ICDE, AAAI, TKDE, and USENIX Security. Two of his papers were selected as one of the best paper finalists in ICDE 2017 and ICME 2020. He is a recipient of the IEEE Computer Society Japan Chapter Young Author Award 2019, Database Society of Japan Kambayashi Young Researcher Award 2021.

17 April 2023; 10:30PM

Title: TBD video
Speaker: TBD
Abstract: TBD
Bio: TBD

15 May 2023; 10:30AM

Title: TBD video 
Speaker: TBD
Abstract: TBD
Bio: TBD

12 June 2023; 10:30AM

Title: TBD video
Speaker: TBD
Abstract: TBD
Bio: TBD