Database Seminar Series (2007-2008)

The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2007-2008 are below.

Unless otherwise noted, all talks will be in room DC (Davis Centre) 1304. Coffee will be served 30 minutes before the talk.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes (usually in pdf format).


Database Seminar Series is supported by iAnywhere Solutions, A Sybase Company.


Samuel Madden
Goetz Graefe
Remzi Arpaci-Dusseau
Christopher Jermaine
Ahmed Elmagarmid
Neoklis Polyzotis
Denilson Barbosa
Amr El Abbadi
Carl-Christian Kanne
Eric Lo

10 September 2007, 10:30 AM

Title: The CarTel Mobile Sensor System
Speaker: Samuel Madden, MIT
Abstract:

CarTel is a software and hardware platform for collecting and processing data from moving vehicles.  Data is captured from GPS, Wi-Fi, and OBD-II interfaces and stored in a local database on each car.  This stored data is transferred opportunistically, via available Wi-Fi networks, cellular modems, or by "muling" data on a user's cell phone or USB key, to a central "portal", where users can browse and visualize it.  To allow non-expert users to specify what data they would like to collect from remote vehicles, CarTel includes a simple database-like interface for programming and configuration.

In this talk, I will discuss the architecture of CarTel data management system and mention some applications of the technology to traffic and fleet management, driving safety, and fuel conservation.   I will also show a brief demo of the portal software, which provides a map-based interface for browsing about 1100 hours of driving data from collected by our system from six cars over the past year.

Bio: Sam Madden is an associate professor in the EECS department at MIT and a member of MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL). His research interests span all areas of database systems; past projects include the TinyDB system for data collection from sensor networks and the Telegraph adaptive query processing engine. His current research focuses on modeling and statistical techniques for value prediction and outlier detection in sensor networks, high performance database systems, and networking and data processing in disconnected environments.  He received his Ph.D. from UC Berkeley in 2004.

15 October 2007, 10:30 AM

Title: Sorting and indexing with partitioned B-trees
Speaker: Goetz Graefe, HP Labs
Abstract: Even after decades of research and industrial development in B-tree indexes and sorting algorithms, some fairly simple yet effective improvements remain possible. This paper describes one of those, partitioning within a B-tree based on an artificial leading key column. It has significant impact on multiple important database operations. First, during external merge sort, partitioned B-trees enable accurate deep read-ahead and dynamic allocation of memory and threads. Second, during index creation, partitioned B-trees reduce the delay until the first query can search the new index. Third, during data loading, partitioned B-trees permit very fast integration of new data into a fully indexed database. Finally, partitioned B-trees enable or speed up several other operations, including indexing of real-time data streams. By exploiting the duality of B-trees and external merge sort, partitioned B-trees simplify and unify data structures and algorithms, yet offer these benefits using only techniques already supported in modern relational database management systems.
Bio: Goetz Graefe recently joined HP Labs to pursue research into database management in support of HP's renewed interest in database management software and hardware. He spent 12 years in product development at Microsoft after several years in academic research and teaching. His publications can be found in ACM Computing Surveys as well as leading journals and conferences in the database field. They have been honored by an ACM SIGMOD "Test of time" award as well as the inaugural IEEE Data Engineering "Influential paper" award. His research has been adopted by Sequent (now IBM), Informix (now IBM), Texas Instruments, Tandem (now HP), Microsoft, and others.

19 November 2007, 10:30 AM

Title: File Systems Are Broken (And What We're Doing To Fix Them)
Speaker: Remzi Arpaci-Dusseau, University of Wisconsin
Abstract:

In this talk, I will present a summary of our recent research in understanding how disks fail and how file and storage systems handle such failures. Our findings reveal numerous design and implementation problems in a wide range of both open-source and commercial systems; put simply, file systems are broken (at least when it comes to reliability).

I will then present a number of current research directions that we are pursuing in order to build a new generation of robust and reliable storage systems. With more formal analysis and construction techniques, we hope to transform the "art" of file system design and implementation into a more rigorous and careful science.

Bio: Remzi Arpaci-Dusseau is an associate professor of Computer Sciences at the University of Wisconsin, Madison. He received his undergraduate degree from the University of Michigan and Masters and Ph.D. from the University of California, Berkeley. Remzi and his wife and colleague Andrea Arpaci-Dusseau have spent the past 10 years working on I/O and storage issues, with a recent focus on reliability. Details of their work can be found at: http://www.cs.wisc.edu/adsl.

10 December 2007, 10:30 AM

Title: Supporting Scalable Online Statistical Processing
Speaker: Christopher Jermaine, University of Florida
Abstract:

Data warehousing and analytic processing have been active areas of database research and development for nearly two decades. Many experts now consider these topics to be "solved", especially with regard to performance. However, an argument can be made that users and databases have simply reached an uneasy truce with regard to analytic processing. If users avoid ad-hoc, exploratory queries that might take days to execute, then of course the database performs just fine.

In this talk, I will describe query processing in a database system called DBO that is designed from the ground up to support interactive analytic processing. DBO can run database queries from start to finish and produce exact answers in a scalable fashion. However, unlike any existing research or production system, DBO is able to produce statistically meaningful approximate answers at all times throughout query execution. These answers are continuously updated from start to finish, even for "huge" queries requiring arbitrary quantities of temporary secondary storage. Thus, a user can stop execution whenever satisfied with the query accuracy, which may translate to dramatic time savings during exploratory processing.

Bio: Chris Jermaine is an assistant professor of computer science at the University of Florida. His research is generally concerned with databases and data analysis. With his students and colleagues at UF he has published papers in forums such as ACM TODS, VLDB, SIGMOD, and ICDE. He is the recipient of an NSF CAREER award for his research on approximate query processing.

14 January 2008, 10:30 AM

Title: Cyber Communities: Enabling Innovation in Science and Engineering
Speaker: Ahmed Elmagarmid, Purdue University
Abstract: Cyber Infrastructure is the chief enabler for innovation in many of today's most pressing Science and Engineering problems. There are in fact complete areas in Science such as Systems Biology that would not exist if not for a flourishing computing and information discipline. In this presentation we will introduce cyber infrastructure (CI), its chief components, and the transition from being an enhancer for discovery to an enabler. We will then introduce Cyber Communities as a viable way to build new CI for domain sciences. We will show through example and demonstration few existing projects in the Cyber Center at Purdue where high performance computing, hub technologies, data analytics and visualization all come together to solve specific problems.
Bio: Professor Elmagarmid is the director of the Indiana Center for Database Systems and the Cyber Center in Discovery Park. He received a Presidential Young Investigator award from the National Science Foundation, and distinguished alumni awards from Ohio State University and the University of Dayton in 1993 and 1995, respectively. His research work has focused on database systems. He has done more than his share in serving the database community through Conferences, Journals and the professional societies.

Professor Elmagarmid was a chief scientist in the Office of Strategy and Technology at Hewlett-Packard (HP). While at HP, he was responsible for software strategy coming out of the corporate CTO office and contributed to cross company roadmap initiatives. He also served on the technology council for HP. Professor Elmagarmid serves as a consultant in the areas of computing technology, the role of technology in societal development and strategic planning. He has several international clients in Qatar and Italy. Within North America, he has worked for Telcordia Technology, Bellcore, IBM, CSC, Harris, D. H. Brown and Associates, MCC, Bell Northern Research, Molecular Design Labs, King & Spalding LLP, and UniSql to name a few.

11 February 2008, 2:30 PM (Please note change of time)

Title: Depth Estimation for Ranking Query Optimization
Speaker: Neoklis Polyzotis, University of California - Santa Cruz
Abstract:

A relational ranking query uses a scoring function to limit the results of a conventional query to a small number of the most relevant answers. The increasing popularity of this query paradigm has led to the introduction of specialized rank join operators that integrate the selection of top tuples with join processing. The number of input tuples that an operator accesses is called the input depth of the operator, and this is the driving cost factor in rank join processing. This introduces the problem of depth estimation, which is crucial for the costing of rank join operators during query compilation and thus for their integration in optimized physical plans.

This talk will provide an overview of our recent work on DEEP, an estimation methodology for approximating the depths of rank join operators in a physical plan. DEEP relies on a general, principled framework in order to formalize depth computation in terms of the joint distribution of scores in the base tables. This results in a systematic estimation methodology that takes the characteristics of the data directly into account and thus enables more accurate estimates. The talk reviews the main technical points of our work and presents experimental results that validate the benefits of the proposed technique over existing methods.

Bio: Alkis Polyzotis is currently an assistant professor at UC Santa Cruz. His research focuses on database systems, and in particular on approximate query answering, on-line database tuning, and P2P databases. He is the recipient of an NSF CAREER award in 2004 and of an IBM Faculty Award in 2005 and 2006. He received his PhD from the University of Wisconsin at Madison in 2003.

31 March 2008, 10:30 AM

Title: Towards Schema-Free Exchange and Update of XML Data
Speaker: Denilson Barbosa, University of Calgary
Abstract: We consider the problem of updating XML data in the context of casual, non-expert users exchanging data (e.g., through Web data sharing services) with limited or no schema-knowledge. We introduce a novel paradigm for updating XML data based on simple yet powerful update operations. In particular, we propose effective methods for translating data from one representation into another and also for determining the appropriate locations for effecting the updates without violating the schemas for the data sources. For concreteness, we discuss an intuitive update language that frees the user from specific schema knowledge and can be implemented with our framework. Moreover, our proposal is much simpler than current XML update languages, and, as such, it is appropriate for non-experts users. We discuss alternative semantics for the update operations as well as efficient algorithms for implementing them. We present an experimental analysis with real XML data from several domains, showing that our method is efficient, and highly effective and accurate.
Bio: Denilson Barbosa is an Assistant Professor at the University of Calgary since 2005, where he joined after receiving a PhD from the University of Toronto. His research interests are in databases and the Web, and he has worked on the storage, update, and consistency checking of XML data, as well as in mappings between disparate data formats. He also worked on autonomic data management tools. Denilson is the recipient of an IBM Faculty Award, and the Alberta Ingenuity Fund New Faculty Award.

2 June 2008, 10:30 AM

Title: New Challenges in the Management of Data
Speaker: Amr El Abbadi, University of California, Santa Barbara
Abstract: Data management has evolved significantly over the last decade. A traditional database managed well defined, structured data, answered precise queries, and stored the data in trustworthy settings. Recent applications have expanded the scope of the data managed to include unstructured and semi-structured data as well as non-persistent data such as data streams. The types of queries expected are diverse, imprecise and including on-line analytical processing and ranking queries. Even the data is often stored in non-trusted third parties and yet these third parties are expected to manage the data and answer queries. Many of these innovative and challenging settings form the basis of the research projects we are exploring at UCSB. In this talk I will discuss recent results related to the execution of privacy preserving operations, detecting fraud in advertisement click streams, and using hardware to overcome performance limitations of software solutions.
Bio: Amr El Abbadi received his Ph.D. in Computer Science from Cornell University. In August 1987 he joined the Department of Computer Science at the University of California, Santa Barbara, where he is currently a Professor and Chair of the department. He has been a visiting professor/scientist at the University of Campinas in Brazil, IBM Almaden Research Center, the Swedish Institute of Computer Science in Stockholm, Sweden and in IRISA at the University of Rennes in France. In 2002-03, he was director of the University of California Education Abroad Program in Cairo, Egypt. He has served as area editor for Information Systems: An International Journal and as an editor of Information Processing Letters (IPL). He served on many database and distributed systems program committees. Currently, he is a board member of the VLDB Endowment. In 2007, he received the UC Santa Barbara Senate Graduate Mentor Award. Prof. El s main research interests and accomplishments have been in understanding and developing basic mechanisms for supporting distributed information management systems, including databases, digital libraries, peer-to-peer systems, and spatial databases.

16 June 2008, 10:30 AM

Title: The Demaq System: Declarative Messaging and Queueing
Speaker: Carl-Christian Kanne, University Mannheim
Abstract: The goal of the Demaq project is to investigate a novel way of thinking about distributed applications that are based on the asynchronous exchange of XML messages. Unlike today's solutions that rely on imperative programming languages and multi-tiered application servers, Demaq uses a declarative language for implementing the application logic as a set of rules. A rule compiler transforms the application specifications into execution plans against the message history, which are evaluated using our optimized runtime engine. This allows us to leverage existing knowledge about declarative query processing for optimizing distributed applications.
Bio: Carl-Christian Kanne is an assistant professor of Information Systems at the University of Mannheim (since 2006), currently on leave as the interim head of the database group at the University of Zurich. His research focuses on data management in a variety of domains, including life sciences and support for distributed business processes. He obtained his Master's (Diplom) and PhD degrees from RWTH Aachen University and University of Mannheim, respectively.

23 June 2008, 10:30 AM

Title: OLAP on Sequence Data
Speaker: Eric Lo, Hong Kong Polytechnic University
Abstract:

Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature. However, traditional online analytical processing (OLAP) systems and techniques were not designed for sequence data and they are incapable of supporting sequence data analysis.

In this talk, I will present the concept of Sequence OLAP, or S-OLAP for short. The biggest distinction of S-OLAP from traditional OLAP is that a sequence can be characterized not only by the attributes’ values of its constituting items, but also by the subsequence/substring patterns it possesses. This talk will cover the concepts of sequence cuboid and sequence data cube as well as the implementation issues.

Bio: Eric Lo is an assistant professor at the Hong Kong Polytechnic University (HKPU) since 2007. He received his Bachelor degree from HKPU and Masters degree from the University of Hong Kong (HKU). He received his PhD degree from ETH Zurich in 2007. His research interests include data engineering, software engineering, and the fusion of both.