Data Systems Seminar Series (2021-2022) | Data Systems Group

The Data Systems Seminar Series provides a forum for presentation and discussion of interesting and current data systems issues. It complements our internal data systems meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.

The talks are usually held on a Monday at 10:30 am in room DC 1302. Exceptions are flagged. Due to Covid-19, until further notice all talks will be virtual over zoom.

The talks are open to public. Please register here.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.

The Data Systems Seminar Series is supported by

20 September 2021; 10:30AM

Title:	Cross-Domain Text-to-SQL Semantic Parsing
Speaker:	Yanshuai Cao, Borealis.AI
Abstract:	Large-scale pre-training has enabled many NLP applications via transfer learning. However, many studies have shown that current deep learning models often rely on superficial cues and dataset biases to achieve seemingly high performance on a given dataset without proper understanding. This talk will discuss the challenges of cross-domain text-to-SQL semantic parsing and how it can be a test-bed for learning to reason in the real world. I will review recent advances in this field, including some of our work tackling the scarce data aspect of this problem. In particular, I will discuss how models encode prior knowledge about this problem's structures; how to train deep transformers on small datasets; and how to perform data augmentation when minor changes could alter the semantics. I will also showcase Turing, the natural language database interface demo built from our cross-domain text-to-SQL semantic parser.
Bio:	Yanshuai Cao is a Senior Research Lead at Borealis AI, conducting R&D and building products for RBC. His research spans natural language processing, generative models, and adversarial machine learning. Yanshuai received his Ph.D. from the University of Toronto under supervision of David J. Fleet and Aaron Hertzmann.

18 October 2021; 10:30AM

Title:	Recent Advances in Transactional Concurrency Control
Speaker:	Goetz Graefe, Google, Inc.
Abstract:	False conflicts have given locking and serializability a reputation for poor concurrency, poor scalability, and poor system performance. Causes include unnecessarily coarse lock scopes, excessive lock durations, and simplistic lock modes. This talk surveys three published techniques that aim to address these false conflicts.
Bio:	Goetz Graefe used to be a professor in Portland, OR and Boulder, CO. He served as a software architect in Microsoft's SQL Server product and as HP Fellow in Hewlett Packard Enterprise. He has been with Google for the last five years. He wrote the Cascades query optimization framework and was awarded the 2017 ACM SIGMOD Edgar F. Codd Innovations Award. He is interested in database query optimization, query execution, indexing, stream indexing, transactions, concurrency control. logging, recovery, and availability.

15 November 2021; 10:30AM

Title:	Adaptive Join Order Optimization using Search Space Linearization
Speaker:	Thomas Neumann, Technische Universität München
Abstract:	Join ordering is one of the core problems of query optimization, as differences in join order can affect the execution time of queries by orders of magnitudes. Unfortunately, the problem is NP hard in general, and real-world queries can join hundreds of relations, which makes exact solutions prohibitive expensive. In this talk we show how to tackle the join ordering problem by using a search space linearization technique. This adaptive optimization mechanism allows for a smooth transition from guaranteed optimality to a greedier approach, depending on the size of problem. In practice, a surprisingly large number of queries can be solved optimally or near optimally, with very low optimization times even for hundreds of relations.
Bio:	Thomas Neumann is a full professor in the Department of Computer Science at the Technical University of Munich. His research interests are in the areas of database systems, query processing, and query optimization. In 2020, he received the Gottfried Wilhelm Leibniz Prize, which is considered the most important research award in Germany.

13 December 2021; 10:30AM

Title:	Basil: Scaling BFT with ACID transactions
Speaker:	Natacha Crooks, University of California, Berkeley
Abstract:	In this talk, I will discuss alternative ways to scale the abstraction of a Byzantine fault-tolerant shared log, which have regained popularity with the recent growth of decentralized trust. Specifically, I will present Basil. Basil leverages ACID transactions to scalably implement this abstraction . Unlike traditional BFT approaches, Basil executes non-conflicting operations in parallel and commits transactions in a single round-trip during fault-free executions. Basil improves throughput over traditional BFT systems by four to five times, and is only four times slower than TAPIR, a non-Byzantine replicated system. Basil's novel recovery mechanism further minimizes the impact of failures: with 30% Byzantine clients, throughput drops by less than 25% in the worst-case.
Bio:	Natacha Crooks is an Assistant Professor at UC Berkeley. She works at the intersection of distributed systems and databases with a recent focus on privacy and integrity in transactional datastores. She obtained her PhD from UT Austin in 2019 for which she obtained the Dennis Ritchie Doctoral Dissertation Award.

10 January 2022; 10:301M

Title:	The Evolving Face of Misinformation in Text, Image, and Video Content
Speaker:	David Doermann, University at Buffalo
Abstract:	The computer vision community has created a technology that, unfortunately, is getting more bad press than good. In 2014, the first GANS paper automatically generated very low resolutions of faces of people that never existed from a random latent distribution. Although the technology was impressive because it was automated, it was nowhere near as good as what could be done with the simple photo editor. In the same year, DARPA started the media forensics program to combat the proliferation of edited images and videos generated by our adversaries. Although DARPA envisioned the development of automated technologies, no one thought they would evolve so fast. Five years later, the technology has progressed to the point where even a novice can modify full videos, i.e., DeepFakes, and generate new content of people and scenes that never existed overnight using commodity hardware. Recently the US government has become increasingly concerned about the real dangers of using “DeepFakes” technologies from both national security and misinformation points of view. To this end, academia, industry, and the government needs to come together to apply technologies, develop policies that put pressure on service providers, and educate the public before we get to the point where “seeing is believing” is a thing of the past. In this talk, I will cover some of the primary efforts in applying counter manipulation detection technology. I will discuss how we are extending existing technology to deal with the problems of detecting GAN generated content and highlighting inconsistencies between the text, audio, image, and video content in heterogeneous media “assets.”
Bio:	Dr. David Doermann is a Professor of Empire Innovation and the Director of the Artificial Intelligence and Data Science Institute at the University at Buffalo (UB). Prior to coming to UB, he was a Program Manager with the Information Innovation Office at the Defense Advanced Research Projects Agency (DARPA) where he developed, selected, and oversaw research and transition funding in the areas of computer vision, human language technologies, and voice analytics. From 1993 to 2018, David was a member of the research faculty at the University of Maryland, College Park. In his role in the Institute for Advanced Computer Studies, he served as Director of the Laboratory for Language and Media Processing and as an adjunct member of the graduate faculty for the Department of Computer Science and the Department of Electrical and Computer Engineering. He and his group of researchers focus on many innovative topics related to the analysis and processing of document images and video, including triage, visual indexing and retrieval, enhancement, and recognition of both textual and structural components of visual media. David has over 250 publications in conferences and journals, is a fellow of the IEEE and IAPR, has numerous awards, including an honorary doctorate from the University of Oulu, Finland, and is a founding Editor-in-Chief of the International Journal on Document Analysis and Recognition

30 May 2022; 10:30AM

Title:	AI Marketplaces
Speaker:	Kaladhar Voruganti, Equinix
Abstract:	Throughout history of AI, we have seen major transformational changes that have made AI algorithms more accurate and accessible to the masses. In first-generation AI systems, human experts manually entered rules (e.g. via LISP, Prolog languages) to control systems, but these systems were mostly brittle, and couldn’t solve real-life problems. Then with the advent of big data and big compute (e.g. re-purposing of GPUs for deep learning), we entered the realm of second-generation AI systems, where it became possible to improve the accuracy of AI systems to match human accuracy for many important everyday tasks like vision, speech recognition/translation, anomaly detection and trends prediction. Thus, AI has become mainstream. However, now we are entering the era of third-generation AI systems where in order to take the accuracy of AI models to the next level, it is necessary for multiple organizations to share data and trained AI models with each other. Thus, in addition to algorithm accuracy, data/model governance, provenance, and trust are paramount as organizations start to share data and algorithms with each other. However, many organizations are hesitant to share their data externally due to privacy and control concerns (use of data for unauthorized reasons). Similarly, organizations are hesitant to use external data due to lack of proper provenance information that could lead to biases and potential security vulnerabilities in the imported data. In this talk we present the concept of “AI Marketplaces” and how they help organizations to share both data and algorithms with each other, and thus, help to take AI solutions across organizational boundaries. We will present the fundamentals of AI marketplace architectures, different types of trust/governance models, security approaches, and how federated AI architectures help to move “Compute to Data” instead of as in the traditional “Data to Compute” computer architecture model. We will also share our experiences in how AI marketplaces are being used in various real-life use cases.
Bio:	Kaladhar Voruganti is a Senior Fellow, Technology and Architecture, in the Office of the CTO at Equinix. Equinix is the largest retail data center company (like an airport) in the world where the large public clouds, networks, financial companies, media companies and enterprises come to interconnect with each other. He is currently working on Distributed AI and AI Marketplace architectures. He previously worked at IBM Research and NetApp CTO office on large scale autonomous systems. He obtained his BSc in Computer Engineering and PhD in Computing Science from University of Alberta, Canada. He has more than 70 patents filed/issued.

23 June 2022; 11:30PM - DC 1302

Title:	A Vision for Data Alignment and Integration in Data Lakes
Speaker:	Renée Miller, Northeastern University
Abstract:	The requirements for integration over massive, heterogeneous table repositories (aka data lakes) are fundamentally different than they are for federated data integration (where the data owned by an enterprise is integrated into a cohesive whole) or data exchange (where data is exchanged and shared among a small set of autonomous peers). In this talk, I will outline a vision for data alignment and integration in data lakes. Data lakes afford new opportunities for using new methods, from network science and other areas, to discover emergent semantics from large heterogeneous collections of data sets. I will illustrate these ideas by discussing the problem of data lake disambiguation, work which received the best paper award in EDBT 2021.
Bio:	Renée J. Miller is a University Distinguished Professor of Computer Science at Northeastern University. She is a Fellow of the Royal Society of Canada and received the US Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the United States government on outstanding scientists and engineers beginning their careers. She received an NSF CAREER Award, the Ontario Premier’s Research Excellence Award, and an IBM Faculty Award. She formerly held the Bell Canada Chair of Information Systems at the University of Toronto and is a fellow of the ACM. Her work has focused on the long-standing open problem of data integration and has achieved the goal of building practical data integration systems. She and her colleagues received the ICDT Test-of-Time Award and the 2020 Alonzo Church Award for Outstanding Contributions to Logic and Computation for their influential work establishing the foundations of data exchange. In 2020, she received the CS-Can/Info-Can Lifetime Achievement Award in Computer Science. Professor Miller is an Editor-in-Chief of the VLDB Journal and former president of the Very Large Data Base (VLDB) Foundation. She received her PhD in Computer Science from the University of Wisconsin, Madison and bachelor’s degrees in Mathematics and Cognitive Science from MIT.