Database Seminar Series (2000-2001) | Data Systems Group

The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2000-2001 are below, and more will be listed as we get confirmations. Please send your suggestions to M. Tamer Özsu.

Unless otherwise noted, all talks will be in room DC (Davis Centre) 1304. Coffee will be served 30 minutes before the talk.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes (usually in pdf format).

Database Seminar Series is supported by iAnywhere Solutions, A Sybase Company.

Vincent Oria

Hector Garcia-Molina

Avigdor Gal

Jan Chomicki

Paul Cotton

Bettina Kemme

Peter Fankhauser

Mitch Cherniack

Paul Larson

Sibel Adali

Alberto Mendelzon

Peter Buneman

Klaus Meyer-Wegener

Jarek Gryz

23 October 2000, 11:00 AM; MC (Mathematics and Computer Building) 5158

Title:	Courseware-On-Demand: Generating New Course Material From Existing Courses
Speaker:	Vincent Oria, New Jersey Institute of Technology
Abstract:	Creation of appropriate educational content is a major challenge and cost factor in the increasing important areas of on-line and life-long learning. The aim of the ongoing Courseware-on-Demand project is to enable the creation of course content by recomposing parts extracted from existing annotated course materials. Course materials comprising a variety of learning objects (slides, video, audio, textbook, etc.), are gathered, decomposed into elementary learning fragments, appropriately annotated, indexed, and stored. They are further augmented with course structures and dependencies (prerequisite and precedence). A course designer who has a particular training goal in mind can pose a query to the Courseware-on-Demand system, expressing that goal. The first step in processing the query is to locate course fragments containing material relevant to the goal. Then, taking into account the dependencies, the system can define a minimal path in the metadata network that leads to the goal. The last step is to physically extract the parts of the course materials retrieved during the previous step, and compose the new teaching material. This is joint work with Silvia Hollfelder and Peter Fankhauser, GMD-IPSI, Darmstadt, Germany.
Bio:	Vincent Oria got a Diplôme d'Ingénieur in 1989 from the Institut National Polytechnique, Yamoussoukro, Côte d'Ivoire (Ivory Coast), a D.E.A in 1990 from Université Pierre et Marie Curie (Paris VI), Paris, France, and a Ph.D. in 1994 in Computer Science from the Ecole Nationale Supérieure des Télécommunications (ENST), Paris, France. He worked as a Research Scientist at the ENST Paris from 1994 to 1996 before joining Prof. Tamer Ozsu's group at the University of Alberta, Canada as a Post-Doct from 1996 to 1999 where he was the primarily researcher of the DISIMA project. He is currently an Assistant Professor in the Department of Computer and Information Science at the New Jersey Institute of Technology (NJIT), USA.

6 November 2000, 11:00 AM

Title:	How to Crawl the Web
Speaker:	Hector Garcia-Molina, Stanford University
Abstract:	Joint work with Junghoo Cho A crawler collects large numbers of web pages, to be used for building an index or for data mining. Crawlers consume significant network and computing resources, both at the visited web servers and at the site(s) collecting the pages, and thus it is critical to make them efficient and well behaved. In this talk I will discuss how to build a "good" crawler, addressing questions such as: How can a crawler gather "important" pages only? How can a crawler efficiently maintain its collection "fresh"? How can a crawler be parallelized? I will also summarize results from an experiment conducted on more than half million web pages over 4 months, to estimate how web pages evolve over time.
Bio:	Hector Garcia-Molina is the Leonard Bosack and Sandra Lerner Professor in the Departments of Computer Science and Electrical Engineering at Stanford University, Stanford, California. From August 1994 to December 1997 he was the Director of the Computer Systems Laboratory at Stanford. From 1979 to 1991 he was on the faculty of the Computer Science Department at Princeton University, Princeton, New Jersey. His research interests include distributed computing systems and database systems. He received a BS in electrical engineering from the Instituto Tecnologico de Monterrey, Mexico, in 1974. From Stanford University, Stanford, California, he received in 1975 a MS in electrical engineering and a PhD in computer science in 1979. Garcia-Molina is a Fellow of the ACM, received the 1999 ACM SIGMOD Innovations Award, and is a member of the President's Information Technology Advisory Committee (PITAC).

20 November 2000, 11:00 AM

Title:	An Authorization Model for Temporal Data
Speaker:	Avigdor Gal, Rutgers University
Abstract:	The use of temporal data has become wide-spread in recent years, within applications such as data warehouses and spatiotemporal databases. In this paper, we extend the basic authorization model by facilitating it with the capability to express authorizations based on the temporal attributes associated with data, such as transaction time and valid time. In particular, a subject can specify authorizations based on data validity or data update time, using either absolute or relative time references. Such a specification is essential in providing access control for predictive data, or in constraining access to data based on currency considerations. We provide an expressive language for specifying such access control to temporal data, using a variation of temporal logic for specifying complex temporal constraints. We also introduce an easy-to-use access control mechanism for stream data. A joint work with Vijay Atluri, Rutgers University. A paper describing this material is available here.
Bio:	Avigdor Gal is a faculty member at the Department of MSIS at Rutgers University. He received his D.Sc. degree from the Technion-Israel Institute of Technology in 1995 in the area of temporal active databases. He has published more than 30 papers in journals, books, and conferences on the topics of information systems architectures, active databases and temporal databases. Together with Dr. John Mylopoulos, Avigdor has chaired the Distributed Heterogeneous Information Services workshop at HICSS98 and he was the guest editor of a special issue by the same name in the International Journal of Cooperative Information Systems. Also, he was the General co-Chair of CoopIS2000. Avigdor has consulted in the area of eCommerce and is a member of the ACM and the IEEE computer society..

4 December 2000, 11:00 AM

Title:	Consistent Query Answers in Inconsistent Databases
Speaker:	Jan Chomicki, University at Buffalo
Abstract:	As the amount of information available in online data sources explodes, there is a growing concern about the consistency and quality of answers to user queries. This talk addresses the issue of using logical integrity constraints to gauge the consistency and quality of query answers. Although it is impractical to enforce global integrity constraints across different data sources and correct integrity violations by updating individual sources, integrity constraints are still useful because they express important semantic properties of data. This talk introduces formal notions of database repair and consistent query answer: A consistent answer is true in every minimal repair of the database. The information about answer consistency serves as an important indication of its quality and reliability. I will show a general method for transforming first-order queries in such a way that the transformed query computes consistent answers to the original query. I will characterize the scope of this method: the classes of queries and integrity constraints it handles. For functional dependencies, I will provide a complete characterization of the computational complexity of computing consistent query answers to scalar aggregation queries (those involving the SQL operators COUNT, MIN, MAX, SUM, or AVG). I will also show that if the set of functional dependencies is in Boyce-Codd Normal Form, the boundary between tractable and intractable can be pushed further using the tools of perfect graph theory. (Joint work with Marcelo Arenas and Leopoldo Bertossi, with a contribution from Vijay Raghavan and Jeremy Spinrad.)
Bio:	Jan Chomicki received his M.S. in computer science with honors from Warsaw University, Poland, and his Ph.D., also in computer science, from Rutgers University. He is currently an Associate Professor at the Department of Computer Science and Engineering, University at Buffalo (formerly: SUNY Buffalo). He has held visiting positions at Hewlett-Packard Labs, European Community Research Center, Bellcore, and Lucent Bell Labs. He is the recipient of 3 National Science Foundation research grants, the author of over 40 publications, and an editor of the book "Logics for Databases and Information Systems," Kluwer, 1998.

22 January 2001, 11:00 AM

Title:	W3C XML Query WG: A Status Report (PDF)
Speaker:	Paul Cotton, Microsoft Canada
Abstract:	The World Wide Web Consortium (W3C) XML Query Working Group [1] was chartered in September 1999 to develop a query language for XML [2] documents. The goal of the XML Query Working Group is to produce a formal data model for XML documents with Namespaces [3] based on the XML Infoset [4] and XML Schemas [5, 6, 7], an algebra of query operators on that data model, and then a query language with a concrete canonical syntax based on the proposed operators. The WG produced its first Requirements document [8] in January 2000. Subsequently it has published an XML Query Data Model Working Draft [9] and an XML Query Algebra Working Draft [10]. The XML Query WG is currently working on a syntax for the XQuery language. This talk will provide an update on the current status of the W3C XML Query WG. This update will include current information on the status of the XML Query Requirements, Data Model and Algebra. The talk will also outline the relationship of the work of the XML Query WG to other W3C XML standards especially XML Schema and XPath [11].
Bio:	Paul Cotton joined Microsoft in May, 2000 and is currently Program Manager of XML Standards. Paul telecommutes to his Redmond job from his home in Ottawa, Canada. Paul has 28 years of experience in the Information Technology industry. He has been involved in computer standards work since 1988 when he began representing Fulcrum Technologies in the SQL Access Group where he was heavily involved in the development of the SQL Call-Level Interface (CLI) which is the de jure standard based on ODBC. Paul has represented his employer and Canada on the ISO SQL and SQL/Multi-Media committees since 1992 and was the Editor of the SQL/MM Full-Text and Still Image documents from 1995 until joining Microsoft. Paul was a founding member in the consortium efforts to standardize JDBC and SQLJ which provide interfaces to SQL for the Java language. Paul has been participating in the W3C XML Activity since mid-1998 when he became IBM's prime representative on the XML Linking and Infoset Working Groups. Paul has been chairperson of the XML Query Working Group and a member of the XML Coordination Group since September 1999. Paul was elected to the W3C Advisory Board in June 2000 soon after joining Microsoft. Paul is also Microsoft's alternate on the XML Protocol Working Group. Paul graduated with a M.Math in Computer Science from the University of Waterloo in 1974.

5 February 2001, 11:00 AM; DC 1302

Title:	Consistent and Efficient Database Replication based on Group Communication (PDF)
Speaker:	Bettina Kemme, McGill University
Abstract:	Data replication is a common mechanism in distributed database systems that provides performance and fault-tolerance. Most early work has focused on eager replication: updates are immediately sent to all copies in order to guarantee consistency. However, database designers often point out that eager replication is not feasible in practice due to the high deadlock rate, the number of messages involved and the poor response times. As a result, the focus has shifted towards faster lazy replication where updates are propagated asynchronously to remote copies. The price paid is poor consistency or severe restrictions to the configuration of the system. This talk suggests that, contrary to what has been claimed, eager replication can be implemented in such a manner as to provide not only clear correctness criteria but also reasonable performance and scalability. This is especially true in environments where communication is fast (e.g., computer clusters). I will present a suite of eager replication protocols based on novel techniques that avoid most of the limitations of current approaches. These protocols take advantage of powerful broadcast primitives to serialize transactions, and to avoid deadlocks and an explicit 2-phase commit. The talk will also discuss the practical implications of integrating these techniques into a real database system, and provide experimental results that prove the feasibility of the approach.
Bio:	Bettina Kemme received her M.S. in Computer Science in 1996 from the Friedrich-Alexander University, Erlangen, Germany, and her Ph.D. in Computer Science in 2000 from the Swiss Federal Institute of Technology (ETH), Zurich, Switzerland. She is currently an Assistant Professor at the School of Computer Science at McGill University, Montreal.

28 February 2001, 2:30 PM, DC 1302

Title:	The XML Query Algebra: State and Challenges (PDF)
Speaker:	Peter Fankhauser, GMD-IPSI
Abstract:	The XML Query Algebra [1] is designed as part of the W3C-Activity XML Query [2] to provide a formal basis for an XML Query Language. It is guided by three principles. Full support of wellformed and valid XML, compositionality, and strong typing. This talk will introduce the overall design, the underlying datamodel, the type system, and the operational semantics of the algebra. Furthermore, the talk will exemplify how to map surface syntaxes such as XPath [3] or XQuery to the algebra, and discuss the relationship between the algebra's typesystem and XML Schema's [4,5,6] typesystem.
Bio:	Peter Fankhauser is head of the department OASYS (Open Adaptive Information systems) and director of the XML Competence Center at the GMD Institute for Integrated Publication and Information Systems (IPSI). His main research and development interests lie in information brokering and transaction brokering. He has been the technical leader in the development of tools for unified access to information retrieval systems, document structure recognition, and the integration of heterogeneous databases. He is member of the W3C XML Query working group, and is one of the editors of the W3C XML-Query Requirements and the XML-Query Algebra working drafts He has received a PhD in computer science from the Technical University of Vienna in 1997.

12 March 2001, 11:00 AM

Title:	Profile-Driven Data Management (PDF)
Speaker:	Mitch Cherniack, Brandeis University
Abstract:	Large amounts of data populate the Internet, yet it is largely unmanaged. Access to this data requires a wide array of network resources including server capacity, bandwidth, and cache space. Applications compete for these resources with little overarching support for its intelligent allocation. The database community has seen this problem before. Data management techniques (e.g., indexing, data organization, buffer management) have long been used in DBMS's to offset the demands of limited resources made by competing applications' requests for data. Such techniques are application-driven; a database administrator (DBA) consults with users to determine their data needs, and tweaks the knobs of data management tools accordingly. But the Internet environment, consisting of autonomous data sources and global in scale, cannot be tamed by a DBA. We take the view that in this environment, the DBA role must be automated, and the specification of application-level data requirements must be made formal by way of processible user profiles. In this talk, I will discuss the Profile-Driven Data Management project: a multi-institution endeavor funded by the National Science Foundation in October, 2001. I will demonstrate the goals and scope of the project, present some early results, and pose some of the many challenges that lie ahead. Joint Work with: Stan Zdonik (Brown University) and Michael J. Franklin (UC Berkeley)
Bio:	Mitch Cherniack received his Ph.D. in Computer Science in 1999 from Brown University. He is currently an Assistant Professor at Brandeis University, Boston.

19 March 2001, 11:00 AM; DC1302

Title:	Optimizing Queries Using Materialized Views (PDF) (Associated paper (PDF))
Speaker:	Per-Ake (Paul) Larson, Microsoft Research
Abstract:	Materialized views can provide orders of magnitude improvements in query processing time, especially for aggregation queries over large tables. To realize this potential, the query optimizer must understand how and when to exploit materialized views. This talk describes a fast and scalable algorithm for determining whether part or all of a query can be computed from materialized views and outlines how the algorithm is integrated into a transformation-based optimizer. The algorithm handles views composed of selections, joins and a final group-by. Multiple rewrites using views may be generated and the optimizer chooses the best alternative in the normal way. Experimental results based on a prototype implementation in Microsoft SQL Server show outstanding performance and scalability. Optimization time increases slowly with the number of views but remains low even up to a thousand views.
Bio:	Paul Larson is a senior researcher in the Database Group at Microsoft Research. Prior to joining Microsoft, he was a Professor of Computer Science at the University of Waterloo.

26 March 2001, 11:00 AM

Title:	A Configurable Application View Storage System (PDF)
Speaker:	Sibel Adali, RPI
Abstract:	Storage management is a well-known method for improving the efficiency of data intensive and networked applications. Today's data management systems handle many non-traditional data formats, ranging from spatial data to images, video and other hybrid representations. This requires the use of specialized methods to query, extract and transform data from multiple, possibly distributed sources. There is a great need to develop efficient and scalable methodologies for storing and reusing the results of computations in such applications. In this paper, we introduce a configurable storage management system that allows programmers to specify a collection of storage management protocols for managing different types of data requests on top of a shared and possibly distributed pool of resources. In addition, dynamic protocol change rules allow the system to shift from one storage management method to another depending on the availability of system resources. Furthermore, the storage management system can be expanded with application specific methods to look-up and re-use stored items. We show how the storage system can be tuned to specific workload specifications with the help of a simulation model that takes into account both the cost of storage management protocols as well as the methods to look-up and re-use stored items.
Bio:	Sibel Adali is an Assistant Professor in Rensselaer Polytechnic Institute, which she joined in 1996 after her PhD from University of Maryland. At RPI, she leads the Multimedia Information Integration Lab. Her research focuses on heterogeneous distributed information systems, database interoperability, query optimization, and multimedia information systems.

23 April 2001, 11:00 AM; DC 1302

Title:	Data (and Links) on the Web (PDF)
Speaker:	Alberto Mendelzon, University of Toronto
Abstract:	In the excitement of extending database technology such as semistructured models and languages to the Web, the importance of links between documents has somehow been left behind. We will present two projects that keep links front and center: the WebSQL/WebOQL query languages and the TOPIC system for computing page reputations. WebSQL and WebOQL are query languages intended to serve as high-level tools for building Web-based information systems in the same way that SQL is used for building traditional information systems. TOPIC is a method for analyzing incoming links to a page in order to determine what are the topics that this page is best known for on the Web.
Bio:	Alberto Mendelzon was born in Buenos Aires and received his graduate degrees from Princeton University. His research interests are in databases and knowledge-bases including database design theory, query languages, semi-structured data, global information systems, and OLAP. Alberto has chaired or co-chaired the Program Committees of ACM PODS (1991), VLDB (1992), DOOD (1995), WebDb (1998), DBPL (1999), and WWW8 (1999). He is an Associate Editor of the ACM Transactions on Database Systems and has guest-edited issues of the VLDB Journal, Theory and Practice of Object Systems, and Journal of Computer and System Sciences. He is a member of the Executive Board ACM PODS and Information Director for ACM SIGMOD.

14 May 2001, 11:00 AM

Title:	Data Provenance (PDF)
Speaker:	Peter Buneman, University of Pennsylvania
Abstract:	When you find some data on the Web, are you concerned about out how it got there? Most of us have learned not to put much faith in what we find when we casually browse the Web. But if you are a scientist or any kind of scholar, you would like to have confidence in the accuracy and timeliness of the data that you find. In particular, you would like to know how it got there -- its provenance. In Bioinformatics there are literally hundreds of public databases. Most of these are not source data: their contents have been extracted from other databases by a process of filtering, transforming, and manual correction and annotation. Thus, describing the provenance of some piece of data is a complex issue. These "curated" databases have enormous added value, yet they often fail to keep an adequate description of the provenance of the data they contain. In this talk I shall describe some recent work on a Digital Libraries project to investigate data provenance. I shall try to describe the general problem and then deal with some technical issues that have arisen, including (a) a semistructured data model that may help in characterizing data provenance (b) tracing provenance through database queries (c) efficient archiving of scientific databases and (d) keys (canonical identifiers) for semistructured data and XML. Joint work with Sanjeev Khanna and Wang-Chiew Tan
Bio:	Peter Buneman is Professor of Computer and Information Science at the University of Pennsylvania. His work in computer science has focussed mainly on databases and programming languages, areas in which he has worked on active databases, database semantics, type systems, approximate information, query languages, and on semistructured data and data formats, an area in which he has recently co-authored a book. He has served on numerous program committees, editorial boards and working groups, and has been program chair for ACM Sigmod, ACM PODS and the International Conference on Database Theory. He was recently elected fellow of the ACM.

28 May 2001, 11:00 AM

Title:	Combining Data Independence and Realtime in Media Servers (PDF)
Speaker:	Klaus Meyer-Wegener, Technical University of Dresden
Abstract:	In order to be flexible and useful, media servers should provide data independence and realtime support. Extensible database systems today offer just the former, media servers just the latter. Going for data independence makes realtime support even harder, since it requires additional mappings and conversions: To play a video in a format different from the one used for storage, realtime conversion is needed. The presentation will introduce the implementation of data independence using so-called converter graphs. The timing of each step in the graph can be described as a jitter-constrained periodic stream, which allows to calculate some characteristics (e.g. buffer size) and supports scheduling. Probabilities of conflict can be used to determine the need for simultaneous playout of the same media stream. Finally, statistical rate-monotonic scheduling would allow to introduce a notion of quality, but an integrated model remains to be developed. This gives an overview of ongoing work in a project called "Memo.real." It is part of a Collaborative Research Center funded by the Deutsche Forschungsgemeinschaft. This implementation builds on the KANGAROO prototype and the Dresden Realtime Operating System (DROPS)
Bio:	Prof. Klaus Meyer-Wegener is a Professor at Dresden University of Technology, the Head of the Database Group. His research interests include multimedia databases, document and workflow management, and digital Libraries.

11 June 2001, 11:00 AM

Title:	Query Optimization via Data Mining
Speaker:	Jarek Gryz, York University
Abstract:	It is almost a trivial observation that the more we know about the data the faster we can access it. This idea was first explored in the early 1980's in semantic query optimization (SQO), a technique that uses semantic information stored in databases as integrity constraints to improve query performance. Some of the ideas developed for SQO have been used commercially, but to the best of our knowledge, no extensive, commercial implementations of SQO exist today. But SQO can be extended beyond the use of just integrity constraints. We can use data mining to discover patterns in data that can be useful for optimization in the same way integrity constraints are. Indeed, by focusing on the most useful patterns we can provide better query performance improvements than it is possible with the traditional SQO. In the first part of this talk, I present an implementation of two SQO techniques, Predicate Introduction and Join Elimination, in DB2 Universal Database. I present the implemented algorithms and performance results using the TPCD and APB-1 OLAP benchmarks. In the second part of the talk I will describe my current research on developing data mining techniques for discovery of check constraints and empty joins and the use of this information for query optimization.
Bio:	Jarek Gryz is an Assistant Professor at York University. His research interests include query Optimization via data mining, semantic query caching, and query optimization in parallel database systems.