The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2004-2005 are below, and more will be listed as we get confirmations. Please send your suggestions to M. Tamer Özsu.
Unless otherwise noted, all talks will be in room DC (Davis Centre) 1304. Coffee will be served 30 minutes before the talk.
We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes (usually in pdf format).
Database Seminar Series is supported by iAnywhere Solutions, A Sybase Company.
|Minos N. Garofalakis|
|Jeffrey F. Naughton|
|Title:||DBCache: A Project on Database Caching Support for Web Applications (PDF)|
|Speaker:||C. Mohan, IBM Almaden Research Lab.|
|Abstract:||In this talk, I will discuss IBM Almaden's DBCache project whose goal is to support database caching for use with web-based applications. I will introduce a new database object called Cache Table that enables persistent caching of the full or partial content of a remote database table. The content of a cache table is either defined declaratively and populated in advance at setup time, or determined dynamically and populated on demand at query execution time. Dynamic cache tables exploit the characteristics of typical transactional web applications with a high volume of short transactions, simple equality predicates, and 3-4 way joins. Based on federated query processing capabilities, we developed a set of new technologies for database caching: cache tables, "Janus" (two-headed) query execution plans, cache constraints, and asynchronous cache population methods. Our solution supports transparent caching both at the edge of content delivery networks and in the middle-tier of an enterprise application infrastructure, improving the response time, throughput and scalability of transactional web applications. This is joint work with M. Altinel, C. Bornhoevd, S. Krishnamurthy, H. Pirahesh and B. Reinwald. The paper on which this talk is based was published in the 29th International Conference on Very Large Databases (VLDB).|
|Bio:||Please go here for the bio.|
4 October 2004, 4:30 PM, DC 1350 (This is SCS Distinguished Lecture; please note the special time and place)
|Title:||On-Line Science: The World-Wide Telescope as a Prototype for the New Computational Science|
|Speaker:||Jim Gray, Microsoft Bay Area Research Center|
|Abstract:||Computational science has historically meant simulation; but, there is an increasing role for analysis and mining of online scientific data. As a case in point, half of the world's astronomy data is public. The astronomy community is putting all that data on the Internet so that the Internet becomes the world's best telescope: it has the whole sky, in many bands, and in detail as good as the best 2-year-old telescopes. It is useable by all astronomers everywhere. This is the vision of the Virtual Observatory -- also called the World Wide Telescope (WWT). As one step along that path I have been working with the Sloan Digital Sky Survey and CalTech to federate their data in web services on the Internet, and to make it easy to ask questions of the database (see http://skyserver.sdss.org and http://skyquery.net). This talk explains the rationale for the WWT, discusses how we designed the database, and talks about some data mining tasks. It also describes computer science challenges of publishing, federating, and mining scientific data, and argues that XML web services are key to federating diverse data sources.|
|Bio:||Jim Gray is part of Microsoft's research group. His work focuses on databases and transaction processing. Jim is active in the research community, is an ACM, NAE, NAS, and AAAS Fellow, and received the ACM Turing Award for his work on transaction processing. He edits of a series of books on data management, and has been active in building online databases like http://TerraService.net and http://SkyServer.sdss.org.|
|Title:||Database replication for commodity database services (PDF)|
|Speaker:||Gustavo Alonso, ETH Zürich|
|Abstract:||Database replication is a complex topic that has attracted many researchers but where there are comparatively few working systems. In this talk I will review several projects in our group that aim at implementing feasible replication solutions that combine performance and consistency. The talk will review database replication as a first step in identifying the problems that need to be addressed. This introduction will be followed by a review of commercial systems and related work to describe the current state of the art. Then the talk will present Ganymed, our most recent replication platform. Ganymed uses a lazy replication strategy that nevertheless appears to be eager to the clients using the replicated database. Unlike other middleware based approaches, Ganymed has a very small footprint and imposes no restrictions on how transactions are submitted to the system, how the data is partitioned or the type of queries involved. Ganymed is being used to scale out commercial engines (Oracle, DB2) by replicating the contents using open source databases (PostgreSQL). It is also the basis for a number of ongoing projects on autonomic scale out and on providing database services as commodity services. In the talk I will also discuss these projects and how the approach taken in Ganymed is a fundamental
stepping stone to achieve these goals.
|Bio:||Gustavo Alonso is professor in the Department of Computer Science at the Swiss Federal Institute of Technology in Zurich (ETHZ). Gustavo Alonso is from Madrid, Spain, where he completed in 1989 his undergraduate studies in Telecommunications Engineering at the Madrid Technical University (UPM-ETSIT). As a Fulbright student, he did his graduate studies in computer science (M.S. 1992, Ph.D. 1994) in the University of California at Santa Barbara. After graduating, he was a visiting scientist in the IBM Almaden Research Laboratory in San Jose, California, where he worked within the Exotica project in areas such as workflow management and transaction processing. In September 1995 he joined ETH where he has since then lead several projects in databases, workflow management, replication, and advanced applications. Currently, Gustavo Alonso leads the Information and Communication Systems Research Group. The research interests of the group include Web Services, grid and cluster computing, databases, workflow management, scientific applications of database and workflow technology (for geographic, astronomical, and biochemical data), pervasive computing and dynamic aspect oriented programming. Gustavo Alonso is co-author of a recently published book on Web Services (Springer Verlag, Berlin 2004, ISBN 3-540-44008-9) and has participated in numerous conferences, panels and projects related to the topic. He also regularly works as an independent consultant in enterprise application integration, Web Services, and middleware projects.|
|Title:||A Call to Regularity (PDF)|
|Speaker:||Moshe Vardi, Rice University|
From the mid 1980s to the mid 1990s, a major theme in database theory was the study of Datalog, the language of logic programs without function symbols. Unfortunately, most decision problems involving Datalog turned out to be undecidable. Furthermore, well-behaved fragments of Datalog turned out to be rather restrictive and unnatural. Surprisingly, a natural and quite general fragment of Datalog did emerge in the mid 1990s, in the context of semistructured data. This fragment is the class of regular queries, whose basic element is that of regular path query. In his talk I will describe the class of regular queries and its well-behavedness in the context of view-based query processing.
|Bio:||Moshe Y. Vardi is the George Professor in Computational Engineering and Director of the Computer and Information Technology Institute at Rice University. He chaired the Computer Science Department at Rice University from January 1994 till June 2002. Prior to joining Rice in 1993, he was at the IBM Almaden Research Center, where he managed the Mathematics and Related Computer Science Department. His research interests include database systems, computational-complexity theory, multi-agent systems, and design specification and verification. Vardi received his Ph.D. from the Hebrew University of Jerusalem in 1981. He is the author and co-author of over 250 technical papers, as well as a book titled "Reasoning about Knowledge". Vardi is the recipient of three IBM Outstanding Innovation Awards and a co-winner of the 2000 Goedel Prize. He holds honorary doctorates from the University of Saarland, Germany, and the University of Orleans, France. Vardi is an editor of several international journals and the president of the International Federation of Computational Logicians. He is a Fellow of the Association of Computing Machinery, the American Association for the Advancement of Science, and the American Association for Artificial Intelligence. He was designated Highly Cited Researcher by the Institute for Scientific Information, and was elected as a member of the US National Academy of Engineering and the European Academy of Sciences.|
|Title:||Privacy-Preserving Piracy Prevention|
|Speaker:||Dennis Shasha, New York University|
|Abstract:||Software, whether computer programs, films, books, or DVDs, is currently easy to steal. It can be copied, modified, and run in spite of the best protections available. The net result is that software authors and performing artists are often not paid by the consumers of their work. Efforts to combat this theft have been based on technologies that don't work and that entail the threat of criminal penalties and gross invasions of privacy.
The Shield system is a flexible framework that can prevent most piracy, preserve privacy, and allow even more flexibility than current systems provide (e.g., no regional zones for DVDs). The framework requires that the operating system and hardware protect the piracy detection software which performs episodic communications with a security infrastructure. The protocols are designed to prevent anyone from learning which software a user device has, even when the user device attempts to use pirated software.
Joint work with Michael Rabin
|Bio:||Dennis Shasha is a professor of computer science at the Courant Institute of NYU where he works with biologists on pattern discovery for microarrays, combinatorial design, and network inference; and with physicists and financial people on algorithms for time series. Other areas of interest include database tuning, tree and graph matching, and cryptographic file systems. In his spare time, he has written three books of puzzles, a biography of great computer scientists, and technical books about database tuning, biological pattern recognition and a book on time series. He also writes the puzzle column for Scientific American and Dr. Dobb's Journal.
After graduating from Yale in 1977, he worked for IBM designing circuits and microcode for the 3090. He completed his Ph.D. at Harvard in 1984 (thesis advisor: Nat Goodman). Since he enjoys typing, he has written a few books: a professional reference book Database Tuning: principles, experiments, and troubleshooting techniques (2002, Morgan Kaufmann), three books about a mathematical detective named Dr. Ecco entitled The Puzzling Adventures of Dr. Ecco (1988, Freeman, and republished in 1998 by Dover), Codes, Puzzles, and Conspiracy (1992, Freeman), Dr. Ecco's Cyberpuzzles (2002, W. W. Norton), a book of biographies about great computer scientists called Out of Their Minds: the lives and discoveries of 15 great computer scientists (1995, Copernicus/Springer-Verlag), and Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications published in 1999 by Oxford University Press. In addition, he has co-authored thirty journal papers, 40 conference papers, and six patents. He writes a monthly puzzle column for Scientific American and Dr. Dobb's Journal.
|Title:||Purpose-Based Database Access Control for Privacy Protection|
|Speaker:||Elisa Bertino, Cerias, Purdue University|
This is joint work with Ninghui Li and Ji-Won Byun (CERIAS, Purdue University)
|Bio:||Elisa Bertino is professor of Computer Science and of Electrical and Computer Engineering (joint appointmennt) at Purdue University and serves as Research Director of CERIAS. She is also a faculty member at Department of Computer Science and Communication of the University of Milan where she is the director of the DB&SEC laboratory. She has been a visiting researcher at the IBM Research Laboratory (now Almaden) in San Jose, at the Microelectronics and Computer Technology Corporation, at Rutgers University, at Telcordia Technologies.
Her main research interests include security, privacy, database systems, object-oriented technology, multimedia systems. In those areas, Prof. Bertino has published more than 250 papers in all major refereed journals, and in proceedings of international conferences and symposia. Her research has been funded by severa; entities and organizations both in USA and Europe, including NSF, the European Union under the 5th and 6th Research Program Framework, IBM, Telcordia, Microsoft, the Italian Telecom. She is a co-author of the books "Object-Oriented Database Systems - Concepts and Architectures" 1993 (Addison-Wesley International Publ.), "Indexing Techniques for Advanced Database Systems" 1997 (Kluwer Academic Publishers), and "Intelligent Database Systems" 2001 (Addison-Wesley International Publ.). She is a co-editor in chief of the Very Large Database Systems (VLDB) Journal and a member of the advisory board of the IEEE Transactions on Knowledge and Data Engineering. She serves also on the editorial boards of several scientific journals, incuding IEEE Internet Computing, ACM Transactions on Information and System Security, IEEE Transactions on Secure and Dependable Computing, Journal, the Journal of Computer Security, Data & Knowledge Engineering, the International Journal of Cooperative Information Systems, Science of Computer Programming. She has been consultant to several Italian companies on data management systems and applications and has given several courses to industries. She is involved in several projects sponsored by the EU. She has served as Program Committee members of several international conferences, such as ACM SIGMOD, VLDB, ACM OOPSLA, as Program Co-Chair of the 1998 IEEE International Conference on Data Engineering (ICDE), as program chair of 2000 European Conference on Object-Oriented Programming (ECOOP 2000), as program chair of the 7th ACM Symposium of Access Control Models and Technologies (SACMAT 2002), and as program chair of the 2004 Extending Database Technology (EDBT 2004) Conference.
Elisa Bertino is a Fellow member of IEEE and a Fellow member of ACM and has been been named a Golden Core Member for her service to the IEEE Computer Society. She received the 2002 IEEE Computer Society Technical Achievement Award for "For outstanding contributions to database systems and database security and advanced data management systems".
|Title:||Sketching Massive Distributed Data Streams|
|Speaker:||Minos N. Garofalakis, Bell Labs|
|Abstract:||There is growing interest in algorithms for querying and analyzing massive, continuous data streams (that is, data that is seen only once and in a fixed order) with limited memory and CPU-time resources. Such streams arise naturally in emerging large-scale event monitoring applications; for instance, network-operations monitoring in large ISPs, where usage information from numerous sites needs to be continuously collected and analyzed for interesting trends. In addition to memory- and time-efficiency concerns, the inherently distributed nature of such applications also raises important communication-efficiency issues, making it critical to carefully optimize the use of the underlying network infrastructure.
In this talk, I will give an overview of some key small-space pseudo-random sketching techniques for effective query processing over streaming data. The focus will be on some of our own efforts in the area, including our very recent work on communication-efficient continuous query tracking over distributed data streams.
|Bio:||Minos Garofalakis is a Member of Technical Staff in the Internet Management Research Department of Bell Labs, Lucent Technologies. He received his PhD in Computer Science from the University of Wisconsin-Madison in 1998. His current research interests lie in the areas of data streaming, data reduction and approximate query processing, network management, and XML databases. Minos has served on a number of program committees for international database and data-mining conferences, including ACM SIGMOD, ACM PODS, VLDB, and ACM SIGKDD; he also currently serves as an Associate Editor for the IEEE Data Engineering Bulletin.|
|Title:||Progress Indicators and Future Directions|
|Speaker:||Jeffrey F. Naughton, University of Wisconsin|
|Abstract:||This talk will consist of two largely disjoint portions - the first will be a discussion of some recent work we have done on providing progress indicators for SQL queries. Many applications provide progress indicators (think of file downloads and the like), but providing similar functionality for SQL queries is perhaps surprisingly challenging and interesting, with close ties to query optimization. In the second part of the talk, time permitting, I will speculate on future directions for DBMS research.|
|Bio:||Jeffrey F. Naughton earned a bachelor's degree in Mathematics from the University of Wisconsin-Madison, and a PhD in Computer Science from Stanford University. He served as a faculty member in the Computer Science Department of Princeton University before moving to the University of Wisconsin-Madison, where he is currently Professor of Computer Science. He has published over 100 technical papers, received the National Science Foundation's Presidential Young Investigator award in 1991, and was named an ACM Fellow in 2002.|
|Title:||Distributed Versioning Techniques for Scaling the Back-end Database Tier in Dynamic Content Server Clusters|
|Speaker:||Cristiana Amza, University of Toronto|
|Abstract:||In this presentation I will introduce and compare two novel database replication techniques based on distributed versioning for dynamic scaling and availability in the database tier of dynamic content web servers. Our techniques provide both throughput scaling and strong data consistency (1-copy serializability) for the database tier. It has generally been believed that this combination is hard to achieve through replication because of the growth of the number of conflicts. As a result, state-of-the-art dynamic content web servers rely on a single very expensive database supercomputer to satisfy the volume of requests.
We exploit the different distributed data versions that naturally come about as a result of asynchronous data replication in order to increase concurrency by running conflicting transactions in parallel on different replicas. The two distributed versioning variations I will present have different concurrency versus transparency trade-offs. The first replication scheme avoids making any changes to the database or web server to support replication by implementing concurrency control and conflict avoidance in an intermediate scheduler tier. In the second scheme, the role of the scheduler tier is minimal, while the database tier is modified to closely integrate replication with fine-grained concurrency control.
Our evaluation of distributed versioning replication uses the TPC-W e-commerce benchmark. For small clusters of up to eight database replicas, we perform measurements of a web site implementing the TPC-W specification. We use simulation to extend our measurement results to larger clusters. We show close to ideal scaling for the most common e-commerce workload, the shopping mix, up to 8 database replicas in the experiments and good scaling to up to 60 database replicas in simulation.
|Bio:||Cristiana Amza obtained her Ph.D. in Computer Science from Rice University in May 2003, her M.S. in Computer Science also from Rice University in 1997 and her B.S. in Computer Engineering from Bucharest Polytechnic Institute in 1991. Her Ph.D. thesis work focused on scaling dynamic content web servers using database replication on a commodity cluster. Dr. Amza joined the University of Toronto in October 2003 as an Assistant Professor in the Department of Electrical and Computer Engineering, where she is continuing her research on dynamic content servers with a new emphasis on automatic reconfiguration for performance tuning and failure transparency. Due to the potential commercial applications of her research, she is currently collaborating with the IBM Toronto Lab Tivoli and DB2 High Availability teams through a recently initiated IBM CAS program.|
|Title:||Efficient Top-k Queries for XML Information Retrieval|
|Speaker:||Gerhard Weikum, Max-Planck-Institut für Informatik|
|Abstract:||Non-schematic XML data that comes from many different sources and inevitably exhibits heterogeneous structures and annotations (i.e., XML tags) cannot be adequately searched using database query languages like XPath or XQuery. Often, queries either return too many or too few results. Rather the ranked-retrieval paradigm is called for, with relaxable search conditions, various forms of similarity predicates on tags and contents, and quantitative relevance scoring.
The method of choice for top-k similarity queries is the TA family of threshold algorithms. Large disk-resident index lists for content terms and tag names strongly suggest using the TA variant with sorted access only for disk I/O efficiency. This method maintains score intervals for top-k candidates and performs a conservative threshold test based on upper bounds of candidate scores. The talk will present the TA-prob method which applies a probabilistic threshold test for fast approximation of top-k results. It is based on score predictors that compute convolutions of score distributions in the yet to be scanned tails of index lists. This methods provides an order of magnitude improvement in run-time at high and probabilistically controllable levels of precision and recall relative to the TA baseline.
TA-prob has been further enhanced to cope with ontology-based query expansion; by using a new incremental merge technique over index lists for expansion terms, it is much more efficient than the traditional techniques that expand queries before execution starts. TA-prob is also the major building block in the TopX search engine for XML data. TopX supports path conditions for all XPath axes as exact constraints and content term conditions as well as ontology-based relaxation of terms and tag names as similarity conditions for ranked retrieval.
|Bio:||Gerhard Weikum is a Research Director at the Max-Planck Institute of Computer Science in Saarbruecken, Germany, where he is leading the research group on databases and information systems. He is also the spokesperson of the International Max-Planck Research School (IMPRS) for Computer Science. He received his diploma and doctoral degrees from the University of Darmstadt, Germany, in 1982 and 1986.
Dr. Weikum's recent working areas include implementation, optimization, and self-organization aspects of distributed information systems such as peer-to-peer systems, dependable workflow management, and intelligent search and organization of semi-structured data on the Web and in intranets. Earlier he held positions at the University of the Saarland in Saarbruecken, Germany, ETH Zurich, Switzerland, with MCC in Austin, Texas, and he was a visiting senior researcher at Microsoft Research in Redmond, Washington.
Dr. Weikum has published more than 100 refereed articles in conferences and journals. He has written a textbook on Transactional Information Systems, jointly authored with Gottfried Vossen and published with Morgan Kaufmann. A paper on self-tuning load control for transaction processing systems that appeared in the 1992 VLDB conference has been awarded with the 2002 VLDB 10-year award for the most influential paper of the conference ten years ago.
Dr. Weikum has served on the editorial boards of various journals including ACM TODS, the premier scientific journal of the database research community, and IEEE-CS TKDE. He is president of the board of trustees of the VLDB Endowment, the steering committee of the International Conference on Very Large Data Bases, one of the two leading international conferences on database and information systems research. He has served on committees of numerous conferences in various roles, including the General co-chair of the 2nd Semiannual Conference on Innovative Data Systems Research (CIDR) in 2005, the PC chair of the ACM SIGMOD International Conference on Management of Data in 2004, the PC co-chair of the IEEE-CS International Conference on Data Engineering in 2000, and the organization of several Dagstuhl Seminars. Dr. Weikum was an invited tutorial speaker at the 5th International Conference on Database Theory (ICDT) in 1995, and an invited keynote speaker at the 6th International Conference on Extending Database Technology (EDBT) in 1998 and the 23rd International Conference on Conceptual Modeling in 2004, as well as the French, German, Indian, and Italian Database Conferences.
|Title:||Top-N Query Processing in an Unstructured P2P System|
|Speaker:||Patrick Valduriez, INRIA-Rennes and Lina, University of Nantes|
|Abstract:||Top-n queries are attractive for users in P2P systems with very large numbers of peers but difficult to support efficiently. Solutions that rely on centralized statistics no longer apply in the general case. In this talk, we describe a fully distributed algorithm for executing Top-n queries in an unstructured P2P system. This work is done in the context of APPA (Atlas P2P Architecture), a P2P data management system we are building.
We present our algorithm for the case of unstructured systems, thus with minimal assumptions, so it could also apply to other P2P systems as well (DHT, super-peer). Our algorithm requires no global information, does not depend on the existence of certain peers and addresses the volatility of peers during query execution. We validated our algorithm through implementation over a 64-node cluster and simulation using the BRITE topology generator and SimJava. Our performance evaluation shows that our algorithm has logarithmic scale up and improves Top-n query response time very well using P2P parallelism.
Joint work with Reza Akbarinia and Esther Pacitti
|Bio:||Patrick Valduriez is Director of Research at INRIA, the national research center for computer science in France and scientific director of the Atlas research group (joint team between INRIA and LINA, University of Nantes) pursuing research in data management in distributed systems. He received his Ph. D. degree and Doctorat d'Etat in Computer Science from the University Pierre et Marie Curie (Paris 6) in 1981 and 1985, respectively.
Between 1999-2002, he was a professor of computer science at University Paris 6, on leave from INRIA. From 1989 to 1999, he led the Rodin group working on distributed database and multimedia technology. From 1995 to 2000, he also led Dyade, the R&D joint venture between Bull and INRIA to foster technology transfer in the areas of Internet (50 people). In Dyade, he contributed to the design and the transfer of the Disco (Distributed Information Search Component) prototype to the KelKoo startup. From 1985 to 1989, he was a senior scientist at Microelectronics and Computer Technology Corp. (MCC) in Austin, Texas. There he participated in the design and implementation of the Bubba parallel database system.
|Title:||Preventive Replication in a Database Cluster|
|Speaker:||Esther Pacitti, Université de Nantes|
|Abstract:||In a database cluster, preventive replication can provide strong consistency without the limitations of synchronous replication. However, the original proposal  assumes full replication and has performance limitations. In this paper, we address these two limitations in order to scale up to large cluster configurations. Our first contribution is a refreshment algorithm that reduces the delay introduced by the algorithm and prevents inconsistencies for partially replicated databases. Our second contribution is an optimization that improves transaction throughput. We describe the implementation of our algorithm in our RepDB* prototype over a cluster of 64 nodes running PostgreSQL. Our experimental results using the TPC-C Benchmark show that our algorithm has excellent scale up and speed up.|
|Bio:||Esther Pacitti is associate professor at the University of Nantes. She obtained her Ph.D. degree in 1998 from PUC-Rio, Brazil. Her current research interests include data management in distributed systems, in particular, replication and query processing in cluster and P2P systems.|