The Data Systems Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.
The talks are usually held on a Monday at 10:30 am in room DC 1302. Exceptions are flagged.
We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.
The Database Seminar Series is supported by
|Saif M. Mohammad|
|Umar Farooq Minhas|
|Speaker:||Guoliang Li, Tsinghua University|
In big data era, database systems face three challenges. Firstly, the traditional heuristics-based optimization techniques (e.g., cost estimation, join order selection, knob tuning) cannot meet the high-performance requirement for large-scale data, various applications and diversified data. We can design learning-based techniques to make database more intelligent. Secondly, many database applications require to use AI algorithms, e.g., image search in database. We can embed AI algorithms into database, utilize database techniques to accelerate AI algorithms, and provide AI capability inside databases. Thirdly, traditional databases focus on using general hardware (e.g., CPU), but cannot fully utilize new hardware (e.g., ARM, AI chips). Moreover, besides relational model, we can utilize tensor model to accelerate AI operations. Thus, we need to design new techniques to make full use of new hardware.
To address these challenges, we design an AI-native database. On one hand, we integrate AI techniques into databases to provide self-configuring, self-optimizing, self-healing, self-protecting and self-inspecting capabilities for databases. On the other hand, we can enable databases to provide AI capabilities using declarative languages, in order to lower the barrier of using AI.
In this talk, I will introduce the five levels of AI-native databases and provide the open challenges of designing an AI-native database. I will also take automatic database knob tuning, deep reinforcement learning based optimizer, machine-learning based cardinality estimation, automatic index/view advisor as examples to showcase the superiority of AI-native databases.
Guoliang Li is a tenured full Professor of Department of Computer Science, Tsinghua University, Beijing, China. His research interests include AI-native database, big data analytics and mining, crowdsourced data management, big spatio-temporal data analytics, large-scale data cleaning and integration. He has published more than 100 papers in premier conferences and journals, such as SIGMOD, VLDB, ICDE, SIGKDD, SIGIR, TODS, VLDB Journal, and TKDE. He is a PC co-chair of DASFAA 2019, WAIM 2014, WebDB 2014, and NDBC 2016. He servers as associate editor for IEEE Transactions and Data Engineering, VLDB Journal, ACM Transaction on Data Science, IEEE Data Engineering Bulletin. He has regularly served as the (senior) PC members of many premier conferences, such as SIGMOD, VLDB, KDD, ICDE, WWW, IJCAI, and AAAI. His papers have been cited more than 6000 times. He got several best paper awards in top conferences, such as CIKM 2017 best paper award, ICDE 2018 best paper candidate, KDD 2018 best paper candidate, DASFAA 2014 best paper runner-up, APWeb 2014 best paper award, etc. He received VLDB Early Research Contribution Award 2017, IEEE TCDE Early Career Award 2014, The National Youth Talent Support Program 2017, ChangJiang Young Scholar 2016, NSFC Excellent Young Scholars Award 2014, CCF Young Scientist 2014.
|Title:||Connected data: pushing the envelope of data analytics systems|
|Speaker:||Marco Serafini, University of Massachusetts Amherst|
|Abstract:||Many advanced data science applications, from social networks to knowledge bases and data integration, analyze complex, high-dimensional, connected data, which is often modeled as a graph. Rather than flattening out connections into a tabular form, these applications treat them as first-class citizens. Applications that mine and navigate connected data push the envelope of traditional data analytics systems, both relational and graph-native, in similar ways. This talk will argue that better system support for connected data ultimately benefits both graph and relational analytics. It will discuss some of these dimensions at different levels of the system stack: from storage systems to large-scale cloud execution platforms, from data analysis algorithms that efficiently deal with large intermediate results to new high-level APIs for emerging applications.|
Marco Serafini is an Assistant Professor in the College of Information and Computer Sciences at the University of Massachusetts Amherst. His research interests are in data management system and distributed systems. His work has impacted popular open-source systems such as Apache Zookeeper and Apache Storm. He designed Arabesque, a system for distributed graph mining. Before joining UMass, he worked at the Qatar Computing Research Institute and Yahoo! Research.
|Title:||The Search for Emotions, Creativity, and Fairness in Language|
|Speaker:||Saif M. Mohammad, National Research Council of Canada|
|Abstract:||Emotions are central to human experience, creativity, and behavior. They are crucial for organizing meaning and reasoning about the world we live in. They are ubiquitous and everyday, yet complex and nuanced. In this talk, I will describe our work on the search for emotions in language -- by humans (through data annotation projects) and by machines (in automatic emotion detection systems). I will outline ways in which emotions can be represented, challenges in obtaining reliable annotations, and approaches that lead to high-quality annotations. The lexicons thus created have entries for tens of thousands of terms. They provide fine-grained scores for basic emotions as well as for valence, arousal, and dominance (argued by some to be the core dimensions of meaning). They have wide-ranging applications in natural language processing, psychology, social sciences, digital humanities, and computational creativity. I will highlight some of the applications we have explored in literary analysis and automatic text-based music generation. I will also discuss new sentiment analysis tasks such as inferring fine-grained emotion intensity and stance from tweets, as well as detecting emotions evoked by art. I will conclude with work on quantifying biases in the way language is used and the impact of such biases on automatic emotion detection systems. From social media to home assistants, from privacy concerns to neuro-cognitive persuasion, never has natural language processing been more influential, more fraught with controversy, and more entrenched in everyday life. Thus as a community, we are uniquely positioned to make substantial impact by building applications that are not only compelling and creative but also facilitators of social equity and fairness.|
|Bio:||Dr. Saif M. Mohammad is Senior Research Scientist at the National Research Council Canada (NRC). He received his Ph.D. in Computer Science from the University of Toronto. Before joining NRC, Saif was a Research Associate at the Institute of Advanced Computer Studies at the University of Maryland, College Park. His research interests are in Emotion and Sentiment Analysis, Computational Creativity, Psycholinguistics, Fairness in Language, and Information Visualization. Saif has served as General Chair for the Canada--UK Symposium on Ethics in AI, co-chair of SemEval (the largest platform for semantic evaluations), and co-organizer of WASSA (a sentiment analysis workshop). He has also served as the area chair for ACL, NAACL, and EMNLP in Sentiment Analysis and Fairness and Bias in NLP. His work on emotions has garnered media attention, with articles in Time, Washington Post, Slashdot, LiveScience, The Physics arXiv Blog, PC World, Popular Science, etc.|
|Speaker:||Umar Farooq Minhas, Microsoft Research|
|Abstract:||Machine learning is transforming database systems research. For example, recent work on “learned indexes” has changed the way we look at the decades-old field of DBMS indexing. The key idea is that indexes can be thought of as “models” that predict the position of a key in a dataset. Indexes can, thus, be learned. The original work by Kraska et al. shows that a learned index beats a B+Tree by a factor of up to three in search time and by an order of magnitude in memory footprint, however it is limited to static, read-only workloads.
In this talk, I will present a new learned index called ALEX which addresses practical issues that arise when implementing dynamic, updatable learned indexes. ALEX effectively combines the core insights from learned indexes with proven techniques used in B+Tree to achieve high performance and low memory footprint. I will present the design and implementation of ALEX along with detailed experiments that show that ALEX not only beats the B+Tree on all workloads but also beats the original Learned Index on read-only workloads. We believe, ALEX presents a key step towards making learned indexes practical for a broader class of database workloads with dynamic updates.
|Bio:||Umar Farooq Minhas is currently a Principle Researcher in the Database Group at Microsoft Research and specializes in the systems aspects of database management and big data analytics platforms. His current research interests include: exploiting machine learning to improve database systems, cloud-based database systems, novel distributed programming frameworks, next-gen virtualization (Docker & Kubernetes), and performance benchmarking. Umar also works closely with product teams in the Azure Data Org – which is responsible for all data management offerings from Microsoft.
Before joining Microsoft Research, Umar worked as a Research Staff Member at the IBM Almaden Research Center where he was co-leading various efforts around big data storage, scheduling, resource provisioning, next generation platforms, and IBM Watson services. His research ideas have been commercialized in IBM Big SQL, a SQL-on-Hadoop platform, and in IBM General Parallel File System (GPFS), a highly scalable, distributed file system.
Umar received a PhD and a Masters of Mathematics in Computer Science from the David R. Cheriton School of Computer Science at the University of Waterloo and a Bachelor of Science in Computer Science from the National University of Computer and Emerging Sciences (Islamabad, Pakistan).
|Speaker:||Juan Sequeda, data.world|
Data Integration has been an active area of computer science research for over two decades. A modern manifestation is as Knowledge Graphs which integrates not just data but also knowledge at scale. Tasks such as Domain modeling and Schema/Ontology Matching are fundamental in the data integration process. Research focus has been on studying the data integration phenomena from a technical point of view (algorithms and systems) with the ultimate goal of automating this task.
In the process of applying scientific results to real world enterprise data integration scenarios to design and build Knowledge Graphs, we have experienced numerous obstacles. In this talk, I will share insights about these obstacles. I will argue that we need to think outside of a technical box and further study the phenomena of data integration with a human-centric lens: from a socio-technical point of view.
|Bio:||Juan F. Sequeda is the Principal Scientist at data.world. He joined through the acquisition of Capsenta, a company he founded as a spin-off from his research. He holds a PhD in Computer Science from The University of Texas at Austin.
Juan is the recipient of the NSF Graduate Research Fellowship, received 2nd Place in the 2013 Semantic Web Challenge for his work on ConstituteProject.org, Best Student Research Paper at the 2014 International Semantic Web Conference and the 2015 Best Transfer and Innovation Project awarded by the Institute for Applied Informatics. Juan is on the Editorial Board of the Journal of Web Semantics, member of multiple program committees (ISWC, ESWC, WWW, AAAI, IJCAI). He was the General Chair of AMW2018, PC chair of ISWC 2017 In-Use track, co-creator of COLD workshop (7 years co-located at ISWC). He has served as a bridge between academia and industry as the current chair of the Property Graph Schema Working Group, member of the Graph Query Languages task force of the Linked Data Benchmark Council (LDBC) and past invited expert member and standards editor at the World Wide Web Consortium (W3C).
Wearing his scientific hat, Juan's goal is to reliably create knowledge from inscrutable data. His research interests are on the intersection of Logic and Data for (ontology-based) data integration and semantic/graph data management, and what now is called Knowledge Graphs.
Wearing his business hat, Juan is a product manager, does business development and strategy, technical sales and works with customers to understand their problems to translated back to R&D.
|Speaker:||George Fletcher, Eindhoven University of Technology|
|Bio:||George Fletcher (PhD, Indiana University Bloomington) is an associate professor of computer science and chair of the Database Group at Eindhoven University of Technology, the Netherlands. His research interests span query language design and engineering, foundations of databases, and data integration. His current focus is on management of massive graphs such as social networks and knowledge graphs. He was a co-organizer of the EDBT Summer School on Graph Data Management (2015) and is currently a member of the LDBC Graph Query Language Standardization Task Force and Property Graph Schema Language Standardization Task Force. His other recent activities include co-organizing an NII Shonan seminar on Graph Database Systems (2018), chairing the Demo PC for EDBT 2020, and serving on the program committees of SIGMOD, VLDB, ISWC, ICDE, and IJCAI.|
|Speaker:||Peter Boncz, CWI|
|Bio:||Peter Boncz holds appointments as tenured researcher at CWI and professor at VU University Amsterdam. His academic background is in core database architecture, with the MonetDB the systems outcome of his PhD -- MonetDB much later won the 2016 ACM SIGMOD systems award. He has a track record in bridging the gap between academia and commercial application, receiving the Dutch ICT Regie Award 2006 for his role in the CWI spin-off company Data Distilleries. In 2008 he co-founded Vectorwise around the analytical database system by the same name which pioneered vectorized query execution -- later acquired by Actian. He is co-recipient of the 2009 VLDB 10 Years Best Paper Award, and in 2013 received the Humboldt Research Award for his research on database architecture. He also works on graph data management, founding in 2013 the Linked Database Benchmark Council (LDBC), a benchmarking organization for graph database systems.|
|Title:||The Evolving Challenges of Media Forensics in a GAN World|
|Speaker:||David Doermann, University at Buffalo|
|Abstract:||The computer vision community has created a technology which unfortunately is getting more bad press then it is good. In 2014, the first GANS paper was able to automatically generate very low resolutions of faces of people which never existed, from a random latent distribution. Although the technology was impressive because it was automated, it was nowhere near as good as what could be done with the simple photo editor. In the same year DARPA started the media forensics program to combat the proliferation of edited images and video that was benign generated by our adversaries. Although DARPA envisioned the development automated technologies, no one thought they would evolve so fast. Five years later the technology has progressed to the point where even a novice can modify full videos, i.e. DeepFakes, and generate new content of people and scenes that never existed, overnight using commodity hardware. Recently the US government has become increasingly concerned about the real dangers of the use of “DeepFakes” technologies from both a national security and a misinformation point of view. To this end, it is important for academia, industry and the government to come together to apply technologies, develop policies that put pressure on service providers, and educate the public before we get to the point where “seeing is believing” is a thing of the past. In this talk I will cover some of the primary efforts in applying counter manipulation detection technology, the challenges we face with current policy in the United States. While technological solutions are still a number of years away, we need a comprehensive approach to deal with this problem.|
|Bio:||Dr. David Doermann is a Professor of Empire Innovation and the Director of the Artificial Intelligence Institute the University at Buffalo (UB). Prior to coming to UB he was a Program Manager with the Information Innovation Office at the Defense Advanced Research Projects Agency (DARPA) where he developed, selected and oversaw research and transition funding in the areas of computer vision, human language technologies and voice analytics. From 1993 to 2018, David was a member of the research faculty at the University of Maryland, College Park. In his role in the Institute for Advanced Computer Studies, he served as Director of the Laboratory for Language and Media Processing, and as an adjunct member of the graduate faculty for the Department of Computer Science and the Department of Electrical and Computer Engineering. He and his group of researchers focus on many innovative topics related to analysis and processing of document images and video including triage, visual indexing and retrieval, enhancement and recognition of both textual and structural components of visual media. David has over 250 publications in conferences and journals, is a fellow of the IEEE and IAPR, has numerous awards including an honorary doctorate from the University of Oulu, Finland and is a founding Editor-in-Chief of the International Journal on Document Analysis and Recognition.|
|Title:||Relational Artificial Intelligence|
|Speaker:||Molham Aref, RelationalAI|
|Abstract:||In this talk, I will make the case for a first-principles approach to machine learning over relational databases that exploits recent development in database systems and theory. The input to learning classification and regression models is defined by feature extraction queries over relational databases. The mainstream approach to learning over relational data is to materialize the results of the feature extraction query, export it out of the database, and then learn over it using statistical software packages. These three steps are expensive and unnecessary. Instead, one can cast the machine learning problem as a database problem, keeping the feature extraction query unmaterialized and using a new generation of meta-algorithms to push the learning through the query. The performance of this approach benefits tremendously from structural properties of the relational data and of the feature extraction query; such properties may be algebraic (semi-ring), combinatorial (hypertree width), or statistical (sampling). Performance is further improved by leveraging recent advances in compiler technology that eliminate the cost of abstraction and allows us to specialize the computation for specific workloads and datasets. This translates to several orders-of-magnitude speed-up over state-of-the-art systems.
This work is done by my colleagues at RelationalAI and by members of our faculty research network, including Dan Olteanu (Oxford), Maximilian Schleich (Oxford), Ben Moseley (CMU), and XuanLong Nguyen (Michigan).
|Bio:||Molham Aref is the Chief Executive Officer of RelationalAI. He has more than 28 years of experience in leading organisations that develop and implement high value machine learning and artificial intelligence solutions across various industries. Prior to RelationalAI he was CEO of LogicBlox and Predictix (now Infor), Optimi (now Ericsson), and co-founder of Brickstream (now FLIR). Molham held senior leadership positions at HNC Software (now FICO) and Retek (now Oracle).|