Data Systems Seminar Series (2019-2020)

The Data Systems Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.

The talks are usually held on a Monday at 10:30 am in room DC 1302. Exceptions are flagged.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.


The Database Seminar Series is supported by


Guoliang Li
Marco Serafini
Saif M. Mohammad
Umar Farooq Minhas
Juan Sequeda
George Fletcher
Molham Aref

10 September 2019, 10:30 am, DC 1302

Title: AI-Native Database notes video
Speaker: Guoliang Li, Tsinghua University
Abstract:

In big data era, database systems face three challenges. Firstly, the traditional heuristics-based optimization techniques (e.g., cost estimation, join order selection, knob tuning) cannot meet the high-performance requirement for large-scale data, various applications and diversified data. We can design learning-based techniques to make database more intelligent. Secondly, many database applications require to use AI algorithms, e.g., image search in database. We can embed AI algorithms into database, utilize database techniques to accelerate AI algorithms, and provide AI capability inside databases. Thirdly, traditional databases focus on using general hardware (e.g., CPU), but cannot fully utilize new hardware (e.g., ARM, AI chips). Moreover, besides relational model, we can utilize tensor model to accelerate AI operations. Thus, we need to design new techniques to make full use of new hardware. 

To address these challenges, we design an AI-native database. On one hand, we integrate AI techniques into databases to provide self-configuring, self-optimizing, self-healing, self-protecting and self-inspecting capabilities for databases. On the other hand, we can enable databases to provide AI capabilities using declarative languages, in order to lower the barrier of using AI.  

In this talk, I will introduce the five levels of AI-native databases and provide the open challenges of designing an AI-native database. I will also take automatic database knob tuning, deep reinforcement learning based optimizer, machine-learning based cardinality estimation, automatic index/view advisor as examples to showcase the superiority of AI-native databases. 

 

Guoliang Li is a tenured full Professor of Department of Computer Science, Tsinghua University, Beijing, China. His research interests include AI-native database, big data analytics and mining, crowdsourced data management, big spatio-temporal data analytics, large-scale data cleaning and integration. He has published more than 100 papers in premier conferences and journals, such as SIGMOD, VLDB, ICDE, SIGKDD, SIGIR, TODS, VLDB Journal, and TKDE. He is a PC co-chair of DASFAA 2019, WAIM 2014, WebDB 2014, and NDBC 2016. He servers as associate editor for IEEE Transactions and Data Engineering, VLDB Journal, ACM Transaction on Data Science, IEEE Data Engineering Bulletin. He has regularly served as the (senior) PC members of many premier conferences, such as SIGMOD, VLDB, KDD, ICDE, WWW, IJCAI, and AAAI. His papers have been cited more than 6000 times. He got several best paper awards in top conferences, such as CIKM 2017 best paper award, ICDE 2018 best paper candidate, KDD 2018 best paper candidate, DASFAA 2014 best paper runner-up, APWeb 2014 best paper award, etc. He received VLDB Early Research Contribution Award 2017, IEEE TCDE Early Career Award 2014, The National Youth Talent Support Program 2017, ChangJiang Young Scholar 2016, NSFC Excellent Young Scholars Award 2014, CCF Young Scientist 2014.

25 October 2019, 2:00 pm, DC 1302 (Please note the unusual day and time)

Title: Connected data: pushing the envelope of data analytics systemsnotes video
Speaker: Marco Serafini, University of Massachusetts Amherst
Abstract: Many advanced data science applications, from social networks to knowledge bases and data integration, analyze complex, high-dimensional, connected data, which is often modeled as a graph. Rather than flattening out connections into a tabular form, these applications treat them as first-class citizens. Applications that mine and navigate connected data push the envelope of traditional data analytics systems, both relational and graph-native, in similar ways. This talk will argue that better system support for connected data ultimately benefits both graph and relational analytics. It will discuss some of these dimensions at different levels of the system stack: from storage systems to large-scale cloud execution platforms, from data analysis algorithms that efficiently deal with large intermediate results to new high-level APIs for emerging applications.
Bio:

Marco Serafini is an Assistant Professor in the College of Information and Computer Sciences at the University of Massachusetts Amherst. His research interests are in data management system and distributed systems. His work has impacted popular open-source systems such as Apache Zookeeper and Apache Storm. He designed Arabesque, a system for distributed graph mining. Before joining UMass, he worked at the Qatar Computing Research Institute and Yahoo! Research.

28 October 2019, 10:30 am, DC 1302 

Title: The Search for Emotions, Creativity, and Fairness in Language notes video
Speaker: Saif M. Mohammad, National Research Council of Canada
Abstract: Emotions are central to human experience, creativity, and behavior. They are crucial for organizing meaning and reasoning about the world we live in. They are ubiquitous and everyday, yet complex and nuanced. In this talk, I will describe our work on the search for emotions in language -- by humans (through data annotation projects) and by machines (in automatic emotion detection systems). I will outline ways in which emotions can be represented, challenges in obtaining reliable annotations, and approaches that lead to high-quality annotations. The lexicons thus created have entries for tens of thousands of terms. They provide fine-grained scores for basic emotions as well as for valence, arousal, and dominance (argued by some to be the core dimensions of meaning). They have wide-ranging applications in natural language processing, psychology, social sciences, digital humanities, and computational creativity. I will highlight some of the applications we have explored in literary analysis and automatic text-based music generation. I will also discuss new sentiment analysis tasks such as inferring fine-grained emotion intensity and stance from tweets, as well as detecting emotions evoked by art. I will conclude with work on quantifying biases in the way language is used and the impact of such biases on automatic emotion detection systems. From social media to home assistants, from privacy concerns to neuro-cognitive persuasion, never has natural language processing been more influential, more fraught with controversy, and more entrenched in everyday life. Thus as a community, we are uniquely positioned to make substantial impact by building applications that are not only compelling and creative but also facilitators of social equity and fairness.
Bio: Dr. Saif M. Mohammad is Senior Research Scientist at the National Research Council Canada (NRC). He received his Ph.D. in Computer Science from the University of Toronto. Before joining NRC, Saif was a Research Associate at the Institute of Advanced Computer Studies at the University of Maryland, College Park. His research interests are in Emotion and Sentiment Analysis, Computational Creativity, Psycholinguistics, Fairness in Language, and Information Visualization. Saif has served as General Chair for the Canada--UK Symposium on Ethics in AI, co-chair of SemEval (the largest platform for semantic evaluations), and co-organizer of WASSA (a sentiment analysis workshop). He has also served as the area chair for ACL, NAACL, and EMNLP in Sentiment Analysis and Fairness and Bias in NLP. His work on emotions has garnered media attention, with articles in Time, Washington Post, Slashdot, LiveScience, The Physics arXiv Blog, PC World, Popular Science, etc.

18 November 2019, 10:30 am, DC 1304 (Please note room change)

Title:

ALEX: An Adaptive Learned Index for Dynamic Workloads notes video

Speaker: Umar Farooq Minhas, Microsoft Research
Abstract: Machine learning is transforming database systems research. For example, recent work on “learned indexes” has changed the way we look at the decades-old field of DBMS indexing. The key idea is that indexes can be thought of as “models” that predict the position of a key in a dataset. Indexes can, thus, be learned. The original work by Kraska et al. shows that a learned index beats a B+Tree by a factor of up to three in search time and by an order of magnitude in memory footprint, however it is limited to static, read-only workloads.

In this talk, I will present a new learned index called ALEX which addresses practical issues that arise when implementing dynamic, updatable learned indexes. ALEX effectively combines the core insights from learned indexes with proven techniques used in B+Tree to achieve high performance and low memory footprint. I will present the design and implementation of ALEX along with detailed experiments that show that ALEX not only beats the B+Tree on all workloads but also beats the original Learned Index on read-only workloads. We believe, ALEX presents a key step towards making learned indexes practical for a broader class of database workloads with dynamic updates.
Bio: Umar Farooq Minhas is currently a Principle Researcher in the Database Group at Microsoft Research and specializes in the systems aspects of database management and big data analytics platforms. His current research interests include: exploiting machine learning to improve database systems, cloud-based database systems, novel distributed programming frameworks, next-gen virtualization (Docker & Kubernetes), and performance benchmarking. Umar also works closely with product teams in the Azure Data Org – which is responsible for all data management offerings from Microsoft.

Before joining Microsoft Research, Umar worked as a Research Staff Member at the IBM Almaden Research Center where he was co-leading various efforts around big data storage, scheduling, resource provisioning, next generation platforms, and IBM Watson services. His research ideas have been commercialized in IBM Big SQL, a SQL-on-Hadoop platform, and in IBM General Parallel File System (GPFS), a highly scalable, distributed file system.

Umar received a PhD and a Masters of Mathematics in Computer Science from the David R. Cheriton School of Computer Science at the University of Waterloo and a Bachelor of Science in Computer Science from the National University of Computer and Emerging Sciences (Islamabad, Pakistan).

13 January 2020, 10:30 am, DC 1302

Title:

The Socio-Technical Phenomena of Data Integration notes video

Speaker: Juan Sequeda, data.world
Abstract:

Data Integration has been an active area of computer science research for over two decades. A modern manifestation is as Knowledge Graphs which integrates not just data but also knowledge at scale. Tasks such as Domain modeling and Schema/Ontology Matching are fundamental in the data integration process. Research focus has been on studying the data integration phenomena from a technical point of view (algorithms and systems) with the ultimate goal of automating this task.

In the process of applying scientific results to real world enterprise data integration scenarios to design and build Knowledge Graphs, we have experienced numerous obstacles. In this talk, I will share insights about these obstacles. I will argue that we need to think outside of a technical box and further study the phenomena of data integration with a human-centric lens: from a socio-technical point of view.

Bio: Juan F. Sequeda is the Principal Scientist at data.world. He joined through the acquisition of Capsenta, a company he founded as a spin-off from his research. He holds a PhD in Computer Science from The University of Texas at Austin.

Juan is the recipient of the NSF Graduate Research Fellowship, received 2nd Place in the 2013 Semantic Web Challenge for his work on ConstituteProject.org, Best Student Research Paper at the 2014 International Semantic Web Conference and the 2015 Best Transfer and Innovation Project awarded by the Institute for Applied Informatics. Juan is on the Editorial Board of the Journal of Web Semantics, member of multiple program committees (ISWC, ESWC, WWW, AAAI, IJCAI). He was the General Chair of AMW2018, PC chair of ISWC 2017 In-Use track, co-creator of COLD workshop (7 years co-located at ISWC). He has served as a bridge between academia and industry as the current chair of the Property Graph Schema Working Group, member of the Graph Query Languages task force of the Linked Data Benchmark Council (LDBC) and past invited expert member and standards editor at the World Wide Web Consortium (W3C).

Wearing his scientific hat, Juan's goal is to reliably create knowledge from inscrutable data. His research interests are on the intersection of Logic and Data for (ontology-based) data integration and semantic/graph data management, and what now is called Knowledge Graphs.

Wearing his business hat, Juan is a product manager, does business development and strategy, technical sales and works with customers to understand their problems to translated back to R&D.

25 May 2020, 10:00 am, online (Please note the unusual time; the talk will be online)

Title: What we talk about when we talk about graphs notes video
Speaker: George Fletcher, Eindhoven University of Technology
Abstract:

An old idea from the humanistic sciences has it that the language we use not only restricts the manner in which we view the world, but also, in a very real sense, shapes the world around us. This view has deep roots across fields as diverse as anthropology, linguistics, and philosophy. We have been exploring the interesting ways in which this idea manifests itself in data management. In particular, we have been studying the expressive power of graph query languages at the instance level, where the focus is on characterizing the ability of languages to restrict and shape concrete graph instances, purely in terms of the structure of the instances.

In this talk, I will begin with a brief recap of such "structural" characterizations of the expressivity of query languages. I will then introduce the framework we have been developing for reasoning over graph structured data. Following this, I will discuss how we put the framework to work, with the design of structural indexes for property graphs (a current industry standard model for graph data). I will also give an overview of our results on effectively computing the characterizations on which these index data structures are built.

I will conclude with an overview of the AvantGraph graph analytics system we are developing in my team, in which we are realizing our structural indexing techniques in practice.
Bio: George Fletcher (PhD, Indiana University Bloomington, 2007) is an associate professor of computer science and chair of the Database Research Group at Eindhoven University of Technology. His research interests span query language design and engineering, foundations of databases, and data integration. His current focus is on management of complex graphs such as social and biological networks. He is co-author of the book "Querying Graphs" (Morgan and Claypool, 2018) on contemporary graph data management and is currently participating in the graph query language international standardization efforts of the LDBC.

10 July 2020, 1:00 pm, online (Please note the unusual time; the talk will be online)

Title: Relational Artificial Intelligence notesvideo
Speaker: Molham Aref, RelationalAI
Abstract: In this talk, I will make the case for a first-principles approach to machine learning over relational databases that exploits recent development in database systems and theory. The input to learning classification and regression models is defined by feature extraction queries over relational databases. The mainstream approach to learning over relational data is to materialize the results of the feature extraction query, export it out of the database, and then learn over it using statistical software packages. These three steps are expensive and unnecessary. Instead, one can cast the machine learning problem as a database problem, keeping the feature extraction query unmaterialized and using a new generation of meta-algorithms to push the learning through the query. The performance of this approach benefits tremendously from structural properties of the relational data and of the feature extraction query; such properties may be algebraic (semi-ring), combinatorial (hypertree width), or statistical (sampling). Performance is further improved by leveraging recent advances in compiler technology that eliminate the cost of abstraction and allows us to specialize the computation for specific workloads and datasets. This translates to several orders-of-magnitude speed-up over state-of-the-art systems.

This work is done by my colleagues at RelationalAI and by members of our faculty research network, including Dan Olteanu (Oxford), Maximilian Schleich (Oxford), Ben Moseley (CMU), and XuanLong Nguyen (Michigan).
Bio: Molham Aref is the Chief Executive Officer of RelationalAI. He has more than 28 years of experience in leading organisations that develop and implement high value machine learning and artificial intelligence solutions across various industries. Prior to RelationalAI he was CEO of LogicBlox and Predictix (now Infor), Optimi (now Ericsson), and co-founder of Brickstream (now FLIR). Molham held senior leadership positions at HNC Software (now FICO) and Retek (now Oracle).