The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2009-2010 are below.
Unless otherwise noted, all talks will be in room DC (Davis Centre) 1302. Coffee will be served 30 minutes before the talk.
We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes (usually in pdf format).
Database Seminar Series is supported by iAnywhere Solutions, A Sybase Company.
|Dirk Van Gucht|
|Title:||Engineering the Billion-Object Authenticated Directory|
|Speaker:||Randal Burns, Johns Hopkins University|
Web 2.0 applications regularly store and access data from a variety of Internet storage services. Such "cloud" storage services have reached remarkable scale in a short period of time: Amazon's S3 contained more than 52 billions objects as of April 2009. Authenticated directories have emerged as the preferred technique to verify the integrity and authenticity of dynamic data. However, existing implementations of authenticated directories employ data structures that exhibit performance problem at scale, making it infeasible to validate data retrieved from the cloud. This talk will describe our ongoing efforts to minimize the I/O required to insert and query an authenticated directory and improve the cache performance of the data structures. The optimizations include the use of incremental cryptography, the replication of authentication metadata, temporal organization of objects in the directory, and the adaptation of fractal trees and buffered repository trees to authenticated directories. When used together, these techniques have already improved the performance of authenticated directories by a factor of five.
|Bio:||Randal Burns is an Associate Professor in the Department of Computer Science in the Whiting School of Engineering at the Johns Hopkins University. His research focuses on the management, performance, and security of large data sets, particularly for scientific and archival applications. His recent results address query processing for data-intensive science applications, remote data checking for outsourced storage services, and adaptive performance management for network data protocols.|
|Title:||Managing Large-scale Probabilistic Databases|
|Speaker:||Christopher Ré, University of Wisconsin|
For the next generation of data-management applications, such as sensor-based monitoring, data integration, and information extraction, data processing is the dominant cost. Often, the data driving these applications are uncertain, for example, due to missed, inconsistent, or imprecise sensor readings. Unfortunately, traditional data-management systems provide little or no support for managing uncertainty. To remedy this, I will discuss an approach for data management in which uncertainty is modeled using probabilities. The cost of modeling imprecision using probabilities is that basic data-management tasks, such as querying, become theoretically and practically more difficult. Thus, the key challenge in managing large-scale probabilistic data is efficiency.
In this talk, I will discuss the fundamental techniques that we used to build a probabilistic database capable of handling large, imprecise datasets: these techniques include top-k processing with probabilities, materialized views, approximate lineage, and extensional processing for complex analytic queries. This work resulted in two systems: Mystiq, the first system to support complex queries on gigabytes of probabilistic relational data, and Lahar, the first system to support rich event-style queries on large probabilistic streams.
|Bio:||Christopher (Chris) Ré's recent work is in the area of probabilistic data management. His work on managing large probabilistic databases is motivated by diverse applications such as RFID data management, information extraction, social networking and data cleaning. His recent work has produced two systems: Mystiq, a system to manage relational probabilistic data, and Lahar, a streaming probabilistic database. The contributions of these systems are techniques to efficiently evaluate queries, such as multisimulation and extensional plans for aggregates, and to efficiently represent probabilistic data using materialized views and approximate lineage. Chris recently completed an internship at Microsoft Research with the Data Management and Exploration group with a project on data cleaning. Prior to beginning his work on probabilistic data management, Chris worked on XML data processing. While visiting IBM, he helped to design the algebraic compiler in Galax (an XQuery processor) and an update language for XQuery, XQuery!. Chris was also lead the development of SilkRoute II, a comprehensive XQuery to SQL translation system. Chris received his PhD from the University of Washington and is currently an assistant professor at the University of Wisconsin-Madison.|
|Title:||Towards a Theory of Dataspace Queries|
|Speaker:||Dirk Van Gucht, Indiana University|
The need to manage diverse information sources has triggered the rise of very loosely structured data models, known as "dataspace models." Such information management systems must allow querying in simple ways, mostly by a form of searching. Motivated by these developments, we propose a theory of search queries in a general model of dataspaces. In this model, a dataspace is a collection of data objects, where each data object is a collection of data items. Basic search queries are expressed using filters on data items, following the basic model of boolean search in information retrieval. We characterize semantically the class of queries that can be expressed by searching. We apply our theory to classical relational databases, where we connect search queries to the known class of fully generic queries, and to dataspaces where data items are formed by attribute-value pairs. We also extend our theory to a more powerful, associative form of searching where one can ask for objects that are similar to objects satisfying given search conditions. Such associative search queries are shown to correspond to a very limited kind of joins. Specifically, we show that the basic search language extended with associative search can done exactly like the queries definable in a restricted fragment of the semi-join algebra working on an explicit relational representation of the dataspace.
(This work was done in collaboration with George Fletcher, Jan Van den Bussche, and Stijn Vansummeren)
|Bio:||Dirk Van Gucht received his PhD from Vanderbilt University in 1985, and he is currently a professor in Computer Science at Indiana University. His interests cover a wide spectrum of database topics, specializing in database theory and data mining. He has published papers in the areas of structured document queries; data mining; reflective or meta query languages; theory of object creation, semi-determinism, computationally complete query languages; object-oriented/graph-oriented query languages; spatial databases, constraint query languages; nested relations, generalized quantifiers; and genetic algorithms.|
|Title:||Towards multi-modal extraction and summarization of conversations (PDF)|
|Speaker:||Raymond Ng, University of British Columbia|
For many business intelligence applications, decision making depends critically on the information contained in all forms of “informal” text documents, such as emails, meeting summaries, attachments and web documents. For example, in a meeting, the topic of developing a new product was first raised. In subsequent follow-up emails, additional comments and discussions were added, which included links to web documents describing similar products in the market and user reviews on those products. A concise summary of this “conversation” is obviously valuable. However, existing technologies are inadequate in at least two fundamental ways. First, extracting “conversations” embedded in multi-genre documents is very challenging. Second, applying existing multi-document summarization techniques, where were designed mainly for formal documents, have proved to be highly ineffective when applied to informal documents like emails.
In this presentation, we give an overview of email summarization and meeting summarization methods. We give short demos on what we have developed so far. We conclude by presenting several open problems that need to be solved for multi-modal extraction and summarization of conversations to become a reality.
|Bio:||Dr. Raymond Ng is a professor in Computer Science at the University of British Columbia. His main research area for the past two decades is on data mining, with a specific focus on health informatics and text mining. He has published over 150 peer-reviewed publications on data clustering, outlier detection, OLAP processing, health informatics and text mining. He is the recipient of two best paper awards - from 2001 ACM SIGKDD conference, which is the premier data mining conference worldwide, and the 2005 ACM SIGMOD conference, which is one of the top database conferences worldwide. He was one of the program co-chairs of the 2009 International conference on Data Engineering, and one of the program co-chairs of the 2002 ACM SIGKDD conference. He was also one of the general co-chairs of the 2008 ACM SIGMOD conference. He was an editorial board member of the Very large Database Journal and the IEEE Transactions on Knowledge and Data Engineering until 2008.|
|Title:||The Orion Uncertain Data Management System|
|Speaker:||Sunil Prabhakar, Purdue University|
Many applications domains are faced with the need to store and manipulate uncertain or imprecise data. Examples include sensor databases, data cleansing, scientific data, and information retrieval systems. Probabilistic modeling of this uncertainty is an attractive option for these applications. In talk we will discuss challenges and some solutions for supporting probabilistic data in relational databases. We will use our experience in developing the Orion uncertain data management system as a concrete example of a potential solution. The challenges extend from the design of probabilistic relational models to performance, and user interfaces. Models are an essential first step with several design choices including the types of uncertainty handled (discrete, continuous, tuple, attribute, etc.) and semantics of the operators. Potential query processing and optimization issues will also be presented.
|Bio:||Sunil Prabhakar is a Professor of Computer Sciences at Purdue University. Dr. Prabhakar's research interests include uncertain data management, moving object and sensor databases, digital rights for databases, large-scale data management, and multimedia databases. Dr. Prabhakar is a recipient of the NSF Career award and is a Purdue University Faculty Scholar. He is a member of the editorial board of the Distributed and Parallel Databases Journal, and has served on the program committees of numerous database conferences including SIGMOD, VLDB, and ICDE. His research has been supported by NSF, AFOSR, Microsoft Corp., IBM Corp., and the Center for Education and Research in Information Assurance and Security (CERIAS) at Purdue University.|
|Title:||DBToaster: Aggressive compilation techniques for online aggregation|
|Speaker:||Christoph Koch, Cornell University|
This talk presents a new compilation technique for turning complex aggregation queries into high-performance native code query processors for update streams. In contrast to today's mainly operator-centric query plan interpreters, DBToaster, our query compiler, aggressively compiles aggregate queries by recursively applying incremental view maintenance ideas, creating a network of data structures that support each other's maintenance under change through extremely simple local modifications. DBToaster is capable of generating incremental engines for a wide range of aggregate queries, including nested aggregate queries for which no incremental maintenance techniques were previously known. Our work is motivated by applications that require the continuous maintenance of aggregation queries on update streams, such as online data warehousing and low-latency algorithmic trading. As we show, our techniques are several orders of magnitude faster than state-of-the-art database and stream processing engines on such workloads. Code produced by DBToaster also admits a surprising form of massive parallelization, allowing for the automated generation of lightweight realtime data warehousing systems.
DBToaster is joint work with Yanif Ahmad and Oliver Kennedy.
|Title:||Exploiting Modern Hardware for Frequent Itemset Mining: The k-Ary Predecessor Trie|
|Speaker:||Wolfgang Lehner, TU Dresden|
|Abstract:||Modern CPUs provide most importantly two ways of parallelism: (1) multiple homogenous cores and (2) SIMD capabilities. The recent development in CPU design brought in addition to an increasing number of cores also wider SIMD registers (e.g. 8-way integer registers for AVX). Using these registers allows processing of instructions on multiple data elements simultaneously. In our current work, we are looking into the potential of SIMD operations for frequent itemset mining. Specifically, we are investigating different physical data structures for the well-known FPGrowth algorithm. Our k-ary predecessor trie seems to be a well-performing implementation of a prefix tree that fully exploits SIMD operations provided by modern processors. Since we are using FPGrowth only as a motivating (and highly relevant) example, our k-ary predecessor trie may be considered a solution for a broader range of applications operating on prefix-tree shaped data structures.|
|Title:||Information Discovery on Vertical Domains|
|Speaker:||Vagelis Hristidis, Florida International University|
As the amount of available data increases, the problem of information discovery, often referred to as finding the needle in the haystack problem, becomes more pressing. The most successful search applications today are the general purpose Web search engines and the well-structured database querying (e.g., SQL).
Directly applying these two search models to specific domains is ineffective since they ignore the domain semantics -meaning of object associations- and the needs of the domain users -a biologist wants to see different results from a physician for the same query on PubMed.
We present challenges and techniques to achieve effective information discovery on vertical domains by modeling the domain semantics and its users, and exploiting the knowledge of domain experts. Our focal domains are products marketplace, biological data, clinical data, and bibliographic data.
This project is being funded by NSF.
|Bio:||Vagelis Hristidis received the BS degree in Electrical and Computer Engineering from the National Technical University of Athens, and the MS and PhD degrees in Computer Science from the University of California, San Diego, in 2004. Since then, he is an Assistant Professor in the School of Computing and Information Sciences at Florida International University, in Miami. Vagelis has received funding from the National Science Foundation, the Department of Homeland Security and the Kauffman Entrepreneurship Center, including the NSF CAREER Award. His main research work addresses the problem of bridging the gap between databases and information retrieval. He has published more than 45 research articles, which have received more than 1,500 references as reported by Google Scholar. He also recently edited and co-authored a book on Information Discovery on Electronic Health Records. For more information, please visit http://www.cis.fiu.edu/~vagelis/.|