Database Seminar Series (2006-2007) | Data Systems Group

The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2006-2007 are below, and more will be listed as we get confirmations. Please send your suggestions to M. Tamer Özsu.

Unless otherwise noted, all talks will be in room DC (Davis Centre) 1304. Coffee will be served 30 minutes before the talk.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes (usually in pdf format).

Database Seminar Series is supported by iAnywhere Solutions, A Sybase Company.

Amol Deshpande

Alex Borgida

Jiawei Han

Johannes Gehrke

Donald Kossmann

Gustavo Alonso

Jignesh Patel

Surajit Chaudhuri

Jun Yang

Ugur Cetintemel

Shivnath Babu

25 September 2006, 11:00 AM

Title:	MauveDB: Managing Uncertain Data using Statistical Models
Speaker:	Amol Deshpande, University of Maryland
Abstract:	Real-world data, especially that generated by distributed measurement infrastructures such as wireless sensor networks, tends to be incomplete, imprecise, and erroneous, and hence rarely usable in its raw form. The traditional approach to dealing with this problem is to first synthesize (filter) such data using a statistical or a probabilistic model, thus resulting in a more robust interpretation of the data. However current database systems do not provide adequate support for statistical modeling of data, especially when those models need to be frequently updated as new data arrives in the system. Hence most scientists and engineers, who depend on models for managing their data, do not use database systems for archival or querying at all; at best, databases serve as a persistent raw data store. In this talk, I will present our approach to integrating statistical and probabilistic models into database systems, in the context of data management in wireless sensor networks. I will first present a data acquisition approach for wireless sensor networks that demonstrates how models can be used both to provide more meaningful answers to user queries, and to significantly reduce the energy cost of acquiring data from the underlying sensing devices. I will then present our recent work on the "MauveDB" system, which uses an abstraction called "model-based views" to seamlessly integrate models into traditional relational database systems.
Bio:	Amol Deshpande is an Assistant Professor at the University of Maryland at College Park. He received his PhD from UC Berkeley in 2004. His research interests are adaptive query processing, sensor network data management, and statistical modeling of data.

16 October 2006, 10:30 AM (Please note time change)

Title:	Visions of Data Semantics: Another (and another) Look
Speaker:	Alex Borgida, Rutgers University
Abstract:	The problem of data semantics is establishing and maintaining a correspondence between a data source (e.g., a database, an XML document) and its intended subject matter. We review the (relatively minor) role data semantics has played in Databases under the term "semantic data models", its more prominent place in ontology-based information integration, and then outline two new views: (i) Semantics as a composition of mappings between models, and (ii) Attaching intensional aspects (stakeholder goals) to Information Systems. In each case we consider the benefits of this view for the important problem of data integration/loading. Joint work with John Mylopoulos and students (Univ. of Toronto)
Bio:	Alex Borgida holds a PhD degree from University of Toronto, and is a Professor of Computer Science at Rutgers University, New Brunswick, NJ. His research is mainly concerned with knowledge representation and its applications. He has published in a variety of areas including Artificial Intelligence (description logics, explanation), Databases (exceptions, semantic data models, data mapping), Software Engineering (requirements modeling, software specification). The main unifying thread of this work is a belief in the importance of languages, which shape the way we think of the problem (an unabashed Whorfian!), and the need to be precise and logical about the semantics of such languages. Alex is co-recipient of the most influential paper award of the 1994 International Conference on Software Engineering, and is proud to have contributed to the design and implementation of the Classic language/logic, which was used by AT&T as part of a system that configured "billions of dollars' worth of equipment sold".

27 November 2006, 10:30 AM

Title:	Warehousing and Mining Massive RFID Data Sets
Speaker:	Jiawei Han, University of Illinois at Urbana-Champaign
Abstract:	Radio Frequency Identification (RFID) applications are set to play an essential role in object tracking and supply chain management systems. In the near future, it is expected that every major retailer will use RFID systems to track the movement of products from suppliers to warehouses, store backrooms, and eventually to points of sale. The volume of information generated by such systems can be enormous as each individual item (a pallet, a case, or an SKU) will leave a trail of data as it moves through different locations. As a departure from the traditional data cube, we propose a new RFID data warehouse model that preserves object transitions while providing significant compression and path-dependent aggregates, based on the following observations: (1) items usually move together in large groups through early stages in the system (e.g., distribution centers) and only in later stages (e.g., stores) do they move in smaller groups, and (2) although RFID data is registered at the primitive level, data analysis usually takes place at a higher abstraction level. Techniques for summarizing data, query processing in FRID data warehouse, RFID flow-cube construction, and data mining based on this framework are developed. We also illustrate a few promising research topics for mining such massive RFID data warehouses. Besides this technical talk, I will give a short summary of our recent research work on data mining.
Bio:	Jiawei Han, Professor, Department of Computer Science, University of Illinois at Urbana-Champaign. He has been working on research into data mining, data warehousing, database systems, data mining from spatiotemporal data, multimedia data, stream and RFID data, social network data, and biological data, with over 300 journal and conference publications. He has chaired or served in over 100 program committees of international conferences and workshops, including PC co-chair of 2005 (IEEE) International Conference on Data Mining (ICDM), Americas Coordinator of 2006 International Conference on Very Large Data Bases (VLDB). He is also serving as the founding Editor-In-Chief of ACM Transactions on Knowledge Discovery from Data. He is an ACM Fellow and has received 2004 ACM SIGKDD Innovations Award and 2005 IEEE Computer Society Technical Achievement Award. His book "Data Mining: Concepts and Techniques" (2nd ed., Morgan Kaufmann, 2006) has been popularly used as a textbook worldwide.

16 January 2007, 9:30 AM, DC 1331 (New Room, New Time!)

Title:	Complex Event Processing with Cayuga
Speaker:	Johannes Gehrke, Cornell University
Abstract:	Publish/subscribe (pub/sub) is a powerful paradigm enabling asynchronous interaction in large distributed settings ranging from Enterprise Application Integration, Internet-scale notification services, to processing events from RFIDs and monitoring the blogosphere. By limiting subscriptions to simple filters on message topics or content, pub/sub systems achieve great scalability in the number of publishers and subscribers. However, today's pub/sub is unfortunately rather limited; in particular, users cannot express conditions that involve more than a single message. In this talk, I will overview the Cornell Cayuga System, a stateful pub/sub system for complex event processing. Cayuga supports powerful subscription features that includes maintenance of state across messages, parameterization and aggregation. Cayuga compiles subscriptions down to simple finite state automata that can be implemented very efficiently. I will conclude with experimental results and first experiences from several prototype deployments.
Bio:	Johannes Gehrke is an Associate Professor in the Department of Computer Science at Cornell University and an Associate Director of the Cornell Theory Center. Johannes' research interests are in the areas of data mining, search, data privacy, complex event processing, and applications of database and data mining technology to marketing and the sciences. Johannes has received a National Science Foundation Career Award, an Arthur P. Sloan Fellowship, an IBM Faculty Award, the Cornell College of Engineering James and Mary Tien Excellence in Teaching Award, and the Cornell University Provost's Award for Distinguished Scholarship. He is the author of numerous publications on data mining and database systems, and he co-authored the undergraduate textbook Database Management Systems (McGrawHill (2002), currently in its third edition), used at universities all over the world. Johannes was co-Chair of the 2003 ACM SIGKDD Cup, Program co-Chair of the 2004 ACM International Conference on Knowledge Discovery and Data Mining (KDD 2004), and he is Program Chair of the 33rd International Conference on Very Large Data Bases (VLDB 2007). At Cornell, Johannes teaches in the Department of Computer Science, the Information Science Program, and in the Johnson Graduate School of Management. He has given courses and tutorials on data mining, data stream processing, and data privacy on Wall Street and all over the world, and he has extensive industry experience as technical advisor.

5 February 2007, 10:30 AM

Title:	Predicate-based Indexing of Annotated Data
Speaker:	Donald Kossmann, ETH Zürich
Abstract:	In many environments, data is annotated either by humans or by software applications. A prominent example is the tagging of Links and Web pages as done by users of, e.g., del.icio.us. Other examples include Office documents (e.g., Word) in which the data is annotated in order to encode layout, versioning, comments, or references to, say, addresses stored in EXCEL. This talk shows why today's generation of search engines do not support such annotated data well. Furthermore, it shows how today's search engines can be extended. The idea is to extend inverted file indices with an additional column that contains predicates. These predicates encode how to interpret the annotations. The talk demonstrates how this extended approach can improve search on tagged Web pages, desktop search, and enterprise search (i.e., Web-based Java applications). Furthermore, results of preliminary performance experiments (precision and recall) are presented. Joint work with Cristian Duda.
Bio:	Donald Kossmann is a professor for Computer Science at ETH Zurich (Switzerland). He received his MS in 1991 from the University of Karlsruhe and completed his PhD in 1995 at the Technical University of Aachen. He is a co-founder of i-TV-T, a German company that develops eProcurement applications. His research interests lie in the area of database and information systems; in particular, Web-based information systems and database applications.

20 February 2007, 2:00 PM, MC 5136 (Please note special time & place)

Title:	SwissQm: A Virtual Machine for Sensor Networks
Speaker:	Gustavo Alonso, ETH Zürich
Abstract:	Sensor networks have become one of the main lines of research in several areas of computer science. The potential for sensor networks is well known and numerous applications have been described and are being explored. A less known fact about wireless sensor networks is that it is very difficult and cumbersome to program, deploy, and getting them to wok in real settings. Recent experience reports confirm the many problems encountered which are caused by both the nature of the problem but also because of the lack of appropriate tools and abstractions to build real systems based on sensor networks. In this talk I will give an overview of the typical problems encountered and describe some of the efforts at ETH Zurich to come up with better infrastructures for sensor networks. In particular, I will describe SwissQM, a virtual machine designed to run on the sensors that is not only efficient but also offers the necessary level of abstraction and interface to develop many of the functionality needed to make turn sensor networks into real systems. Work done in collaboration with Rene Müller and Donald Kossmann.
Bio:	Gustavo Alonso is professor in the Department of Computer Science at the Swiss Federal Institute of Technology in Zurich (ETHZ). He holds degrees in Telecommunications Engineering from the Madrid Technical University (1989) and in computer science (M.S. 1992, Ph.D. 1994) from the University of California at Santa Barbara. Before joining ETH Zurich, he was a visiting scientist in the IBM Almaden Research Laboratory in San Jose, California. Currently, Gustavo Alonso leads the Information and Communication Systems Research Group and is the Chair of the Institute for Pervasive Computing. For more information on the activities of the group, please contact www.iks.ethz.ch.

12 March 2007, 9:00 AM (Please note special time)

Title:	Towards Declarative and Efficient Querying on Biological Datasets
Speaker:	Jignesh Patel, University of Michigan
Abstract:	Modern life sciences explorations often need to analyze and manage large volumes of complex biological data. Unfortunately, existing life sciences applications often employ awkward procedural querying methods and use query evaluation algorithms that do not scale as the data size increases. For example, data is often stored in flat files and queries are expressed and evaluated by programs written in Python. The perils of employing such procedural querying methods are well known to a database audience, namely a) severely limiting the ability to rapidly express complex queries, and b) often resulting in very inefficient query plans as sophisticated query optimization and evaluation methods are not employed. The problem is likely to get worse in the future as many life sciences datasets are growing at a rate faster than Moore's Law. Furthermore, the queries that scientists want to pose are also rapidly increasing in their complexity. The focus of this talk is on a database approach to querying biological datasets. I will describe the ongoing work in the Periscope project in which we are developing a system for declarative and efficient querying on genomic and proteomics datasets.
Bio:	Jignesh M. Patel is an Associate Professor at the University of Michigan. He graduated with a PhD from the University of Wisconsin in 1998. Since 1999, he has been a faculty member in the EECS department at the University of Michigan, where his research has focused on bioinformatics, spatial query processing, and XML query processing. He is the recipient of a NSF Career Award, and multiple IBM Faculty Awards. He has served on a number of Program Committees including ACM SIGMOD and VLDB, and has served as an Associate Editor for the Systems and Prototype section of ACM SIGMOD Record, a Vice-Chair for IEEE International Conference on Data Engineering 2005, and an Associate Editor for the IEEE Data Engineering Bulletin.

30 April 2007, 10:30 AM

Title:	Self-Managing DBMS Technology: The AutoAdmin Experience
Speaker:	Surajit Chaudhuri, Microsoft Research
Abstract:	The cost of ownership of any commercial database system is significant. The AutoAdmin project at Microsoft Research was initiated (well before the term Autonomic Computing became a buzzword) to develop techniques to reduce the overhead of database administration. Our goal was to make it easier to monitor the server and develop self-tuning techniques for performance management. The technology from this project has been incorporated in the Microsoft SQL Server 2005 (and earlier releases - SQL Server 7.0 and SQL Server 2000). This talk will take a look at some of the past research results and discuss challenges and opportunities in self-tuning DBMS research.
Bio:	Surajit Chaudhuri is a Senior Researcher and leads the Data Management and Exploration Group at Microsoft Research. His areas of interest include self-tuning database systems, query optimization, data cleaning and other tools for data integration, understanding synergy between IR and DBMS. As his work outside of database research, he led the development of CMT, a conference management service hosted by Microsoft Research since 1999 for the academic community. Surajit has a PhD from Stanford University and is an ACM Fellow. He was awarded the SIGMOD Contributions Award in 2004.

14 May 2007, 10:30 AM

Title:	Data-Driven Processing in Sensor Networks
Speaker:	Jun Yang, Duke University
Abstract:	Wireless sensor networks enable data collection from the physical environment on unprecedented scales. In this talk, I will describe some data processing problems that arise in building an environmental sensing network in Duke Forest, in collaboration with ecologists and statisticians. Because of severe resource constraints on battery-powered sensor nodes, it is infeasible to collect and report all raw readings for centralized processing. An effective approach is model-driven data acquisition, which avoids acquiring readings that can be accurately predicted from known spatio-temporal models of data. We argue for an alternative, data-driven approach, which exploits models in optimizing push-based reporting, but does not depend on the quality of models for correctness. A particularly thorny issue with push-based reporting is transmission failures, which are common in sensor networks, and make failed reports indistinguishable from intentionally suppressed ones. The cost of implementing reliable transmissions is prohibitively high. We show how to inject application-level redundancy in data reporting to enable efficient, effective, and principled resolution of uncertainty in the missing data.
Bio:	Jun Yang received his B.A. from University of California at Berkeley in 1995, and his Ph.D. from Stanford University in 2001. He is currently an Assistant Professor of Computer Science at Duke University. He is broadly interested in research on data management, and is currently focusing on derived data maintenance, continuous query systems, and sensor data processing. He is a recipient of the National Science Foundation CAREER Award and the IBM Faculty Award.

25 June 2007, 10:30 AM

Title:	Processing and Routing Streams in a Networked World
Speaker:	Ugur Cetintemel, Brown University
Abstract:	This talk will provide an overview of our recent work on building distributed stream-oriented software infrastructures at Brown. Our goal is to provide robust, scalable software support for an emerging class of applications that require collection, processing and distribution of large volumes of real-time data streams, generated by a number of potentially distributed data sources (such as cameras, weather stations, and network monitors). In particular, the talk will cover the high-level design and the key features of Borealis, a distributed stream processor, and XPORT, a distributed publish/subscribe system. The talk will also highlight our ongoing work on integrating these two systems and some future directions.
Bio:	Ugur Cetintemel is an assistant professor in the Computer Science Department at Brown University. He received a Ph.D. in Computer Science from the University of Maryland, College Park, in 2001. His current work focuses on the architecture and performance of advanced database systems, with an emphasis on streaming data. He is a Brown University Manning Assistant Professor and a recipient of the NSF CAREER Award. He is also a co-founder and a senior architect for StreamBase Inc.

17 August 2007, 2:00 PM (Please note special day and time)

Title:	Experiment-Driven Management of Web Services and Scientific Applications
Speaker:	Shivnath Babu, Duke University
Abstract:	Database-backed Web services (e.g., Amazon, eBay, Yahoo!) play an important role in our daily lives. The performance P (e.g., throughput) of a Web service S is a complex function of its workload W, resource allocation R, and the large number of configuration parameters C that affect S. Furthermore, P may be dictated by unknown interactions among W, R, and C. We have developed a systematic approach based on statistical design of experiments and active machine-learning to discover these dependencies and interactions accurately and comprehensively. Our approach plans a small set of experiments, where each experiment observes P for a selected combination. In this talk, I will describe how we use the experiment-driven approach to process four basic queries in Web-service management; a harness that leverages virtualization to conduct experiments with specified combinations; and an empirical evaluation using two multitier Web services that demonstrates the feasibility and usefulness of our approach. I will conclude by describing how we applied the same experiment-driven approach to manage scientific applications in a utility computing setting.
Bio:	Shivnath Babu is an Assistant Professor of Computer Science at Duke University. He received his Ph.D. from Stanford University in 2005. He was awarded a National Science Foundation Early CAREER Award in 2007 for his work on the Ques project on Querying and Controlling Systems. He is also the recipient of two IBM Faculty Awards. His current research focuses on making large-scale databases and systems easier to manage.