# Database Seminar Series (2001-2002)

The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2001-2002 are below, and more will be listed as we get confirmations. Please send your suggestions to M. Tamer Özsu.

Unless otherwise noted, all talks will be in room DC (Davis Centre) 1304. Coffee will be served 30 minutes before the talk.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes (usually in pdf format).

Database Seminar Series is supported by iAnywhere Solutions, A Sybase Company.

 Anthony J. Bonner Qiang Zhu Sylvia Osborn Jose Blakeley Paul Cotton Pat Martin H.V. Jagadish Daniel Barbara Philip Bernstein Ouri Wolfson Derick Wood Raymond Ng

## 24 September 2001, 11:00 AM

 Title: Large-Scale Physical Genome Mapping (PDF) Speaker: Anthony J. Bonner, University of Toronto Abstract: A major step in understanding the genome of an organism is constructing a physical map, that is, an assignment of DNA fragments to their locations on the genome. Building complete maps of large genomes often requires integrating many kinds of experimental data, each with its own forms of noise, experimental error and anomalies, and with subtle relationships between the different forms of data. Unfortunately, the quantity and complexity of the data greatly complicate the map-assembly process, limiting the effectiveness and flexibility of many map-assembly algorithms. This talk outlines the biology of physical genome mapping, the computational problems involved, and our approach to solving them. This approach is based on abstracting the experimental data as a graph. Intuitively, each node in the graph represents a point on the genome, and each edge represents evidence that two points are close together on a chromosome. Many forms of physical mapping data can be abstracted in this way. The result is that genome map assembly becomes largely a problem of graph manipulation. Graph algorithms and graph visualization play key roles in our approach to the map-assembly problem. Bio: Dr. Bonner is an Associate Professor of Computer Science at the University of Toronto, where he has been a faculty member since 1991. He received his M.S. and Ph.D. in Computer Science from Rutgers University in 1990 and 1991, respectively, and his B.Sc. in Mathematics and Physics from the University of Toronto in 1977. He has held visiting positions at the University of Pennsylvania and at INRIA-Rocquencourt. From 1977 to 1983, he worked in signal processing and underwater acoustics at the Defence Research Establishment Atlantic, in Nova Scotia, Canada. He is the author of over 50 publications and presentations.

## 15 October 2001, 11:00 AM

 Title: Local Cost Estimation for Global Query Optimization in a Multidatabase System (PDF) Speaker: Qiang Zhu, The University of Michigan - Dearborn Abstract: A multidatabase system (MDBS) integrates data from pre-existing autonomous local databases managed by heterogeneous database management systems, such as ORACLE, DB2 and ObjectStore, in a distributed environment. A user can issue a global query on the MDBS to retrieve data from multiple local databases. The user does not need to know where the data is stored and how the result is obtained. How to process such a global query efficiently is the task of global query optimization. There are a number of challenges for performing global query optimization in an MDBS. Among them, a crucial one is that some local optimization information, such as local cost models, may not be available to the global query optimizer because of local autonomy. To perform global query optimization, methods for estimating necessary local cost parameters of an autonomous local database system at the global level are required. Research on exploring such methods has attracted a number of researchers in the database area. In this talk, we will present several our techniques to tackle this challenge, including (1) the query sampling method, which drives local cost models for an MDBS based on observed costs of sample queries; (2) the qualitative approach, which extends the query sampling method for a dynamic multidatabase environment by incorporating a qualitative variable into a cost model; (3) the fractional analysis, which estimates a query cost by analyzing its fractions for a gradually-changing multidatabase environment; and (4) the probabilistic approach, which estimates query costs based on Markov chain theory for a rapidly-changing multidatabase environment. Our experimental results demonstrate that this set of techniques is quite promising in solving the local query cost estimation problem for various multidatabase environments. Bio: Dr. Qiang Zhu is an Associate Professor at The University of Michigan - Dearborn. He received his Ph.D. in Computer Science from the University of Waterloo in 1995. He also holds an M.Sc. degree from the McMaster University in Canada and an M.Eng. degree from the Southeast University in China. Dr. Zhu is a principal investigator for a number of database research projects funded by sources including NSF and IBM at The University of Michigan. He has over 40 research publications in various journals and conference proceedings. Dr. Zhu has served as a program committee member and session/workshop chair for a number of international conferences. His current research interests include query optimization for advanced database systems, multidatabase systems, Web-based database technology, and data mining.

## 12 November 2001, 11:00 AM

 Title: Database Applications of Role-Based Access Control (PDF) Speaker: Sylvia Osborn, University of Western Ontario Abstract: Role-Based Access Control (RBAC) has been receiving a lot of attention in the last ten years as an alternative to more traditional forms of access control, namely discretionary and mandatory access control. We will introduce RBAC models, and show how they are and could be used for relational databases. As well, we will show some algorithms for role graphs, and describe how they can be used to integrate security information when two databases or systems are being integrated. Bio: Sylvia Osborn received her PhD from the University of Waterloo in 1978. Since 1977 she has been a faculty member in the Computer Science Department at the University of Western Ontario in London. Her research interests are in role-based access control, object-oriented databases, and database integration.

## 3 December 2001, 11:00 AM

 Title: Programmability in the SQL Server DBMS (PDF) Speaker: Jose Blakeley, Microsoft Corp. Abstract: The SQL Server product is broadening the set of programming languages that can be used by database developers to write business logic in the form of functions, procedures, triggers, and types. A key component of this work is hosting the .NET Common Language Runtime inside the SQL Server process. This talk will provide an overview of the relevant design decisions in hosting the .NET Runtime and their impact on performance, scalability, security, and robustness of the server. I will also describe the new features, such as functions and types, exposed through the SQL language. Bio: José Blakeley joined Microsoft in 1994. He served as an architect for the OLE DB data access interfaces during 1995-1998. He is currently an architect in the SQL Server product working on server-side programmability and extensibility. José has authored several conference and journal articles and book chapters on design aspects of relational and object database management systems, and data access. Before joining Microsoft, José was a member of the technical staff with Texas Instruments where he developed an object database management system. He received a Ph.D. in Computer Science from the University of Waterloo in 1987.

## 1 February 2002, 10:00 AM (Please note the special date and time)

 Title: W3C XML Query WG: A Status Report (PDF) Speaker: Paul Cotton, Microsoft Canada Abstract: The World Wide Web Consortium (W3C) XML Query Working Group [1] was chartered in September 1999 to develop a query language for XML [2] documents. The goal of the XML Query Working Group is to produce a formal data model for XML documents with Namespaces [3] based on the XML Infoset [4] and XML Schemas [5, 6, 7], a formal semantics for a set of query operators on that data model, and then a query language with a concrete canonical syntax based on the proposed operators. The WG recently produced a revision of the XQuery specifications [8-11] which are now aligned with the first draft of XPath 2.0 [12]. This talk will provide an update on the current status of XQuery and XPath. This update will include current information on the status of the XML Query Data Model, Formal Semantics, and the XQuery and XPath languages. The talk will also include a list of the important issues and questions still in front of the XML Query WG. Bio: Paul Cotton joined Microsoft in May, 2000 and is currently Program Manager of XML Standards. Paul telecommutes to his Redmond job from his home in Ottawa, Canada. Paul has been participating in the W3C XML Activity since mid-1998 when he became IBM's prime representative on the XML Linking and Infoset Working Groups. Paul has been chairperson of the XML Query Working Group and a member of the XML Coordination Group since September 1999. Paul was elected to the W3C Advisory Board in June 2000 and in December 2001 was elected to the first W3C Technical Architecture Group. Paul is also Microsoft's alternate on the XML Protocol Working Group which is working on SOAP 1.2. Paul graduated with a M.Math in Computer Science from the University of Waterloo in 1974.

## 11 February 2002, 11:30 AM (Please note the time change)

 Title: Automatically Tuning DBMS Buffer Pools Speaker: Pat Martin, Queens University Abstract: The tasks of configuring and tuning large database management systems (DBMSs) have always been both complex and time-consuming. They require knowledge of the characteristics of the system, the data, and the workload. The increasing diversity of the data and the workloads handled by today's systems is making manual tuning by database administrators almost impossible. Self-managing DBMSs attempt to solve this problem by shifting responsibility for tuning and configuration onto the system itself. In this talk I will first present a self-tuning algorithm, called the Dynamic Reconfiguration algorithm (DRF), for managing the buffer pools in a DBMS and discuss the results of a set of experiments to investigate the performance of the algorithm. I will then propose a set of principles for a self-managing DBMSs and suggest directions for moving towards the goal of self-management. Bio: Patrick Martin is a Professor in the Department of Computing and Information Science at Queen's University and a Visiting Scientist with the Centre for Advanced Studies at the IBM Toronto Laboratory. His research interests include self-managing database systems and web caching.

## 11 March 2002, 11:00 AM

 Title: Timber: A Native XML Database Management System (PDF) Speaker: H.V. Jagadish, University of Michigan Abstract: The eXtensible Markup Language (XML) has recently become very popular as a representation format for a wide variety of data. Large repositories of XML data have begun to emerge. The effective management of XML in a database thus becomes a pressing issue. A central challenge in this regard is the complex and heterogeneous structure of XML data. In this talk, I will discuss the design of Timber, a native XML database management system that we are building at the University of Michigan. Bio: H. V. Jagadish is a Professor of Computer Science and Engineering at the University of Michigan in Ann Arbor. After earning his PhD from Stanford in 1985, he spent over a decade at AT&T Bell Laboratories in Murray Hill, N.J., eventually becoming head of AT&T Labs database research department at the Shannon Laboratory in Florham Park, N.J. He has also served as a Professor at the University of Illinois in Urbana-Champaign. Professor Jagadish is well-known for his broad-ranging research on databases, and has over 80 major papers and 20 patents. He is currently the founding editor of the ACM SIGMOD Digital Review. Among many professional positions he has held, he has previously been an Associate Editor for the ACM Transactions on Database Systems (1992-1995) and Program Chair of the ACM SIGMOD annual conference (1996).

## 8 April 2002, 11:00 AM

 Title: Clustering by Impact: Scalable, Incremental Clustering of Data Streams (PDF) Speaker: Daniel Barbara, George Mason University Abstract: As organizations tend to store and analyze more and more data, the need to design and implement data mining algorithms that are incremental and can deal with continuous data streams is pressing. In this talk we present two clustering algorithms that meet that need. They cluster new points by measuring the impact these points have on existing cluster's properties. In particular, Fractal Clustering deals with numerical domains and uses the impact over the fractal dimension of the existing clusters to decide where to place a new point. Its counterpart, Entropy Clustering, is designed to deal with categorical domains and it uses the impact over the entropy of the existing clusters to perform clustering. We will show results of using these techniques over real and synthetic data sets, as well as show the techniques' relationship to the well-known principle of Minimum Description Length. We will also show how these techniques are well-suited to track the evolution of clusters in data streams, due both to the concise representations of the clustering models they produce, and incremental nature of the algorithms. Bio: Daniel Barbara is an Associate Professor in the Information and Software Engineering Department, George Mason University. He got his Ph.D. in Computer Science at Princeton University in 1985, and has held positions in Panasonic Laboratories and Bell Communication Research as a Senior Scientist. He is the author of over 60 publications in international conferences and research journals. Dr. Barbara's current interest are in the areas of Data Warehousing and Data Mining.

## 29 April 2002, 11:00 AM

 Title: Generic Model Management - A Database Infrastructure for Schema Manipulation (PDF) Speaker: Philip Bernstein, Microsoft Research Abstract: Despite 30 years of research on database support for engineering applications, such applications remain complicated and hard to build. To improve this situation by an order of magnitude, a much higher level API is needed. We present such an interface, called Model Management. Its objects are models and mappings. By "model," we mean a complex structure that represents a design artifact, such as a relational schema, XML schema, object-oriented interface, UML model, web-site map, or software configuration. By "mapping," we mean an explicit representation of connections or transformations between two models. The main operations of Model Management are match, merge, diff, and compose. We explain how these operations can be used to solve classical meta data management problems and sketch a system architecture to implement them. Bio: Phil Bernstein is a researcher at Microsoft Corporation. Over the past 25 years, he has been a product architect at Microsoft and at Digital Equipment Corp., a professor at Harvard University and Wang Institute of Graduate Studies, and a VP Software at Sequoia Systems. During that time, he has published over 100 articles on the theory and implementation of database systems, and coauthored three books, the latest of which is Principles of Transaction Processing for the System Professional (Morgan Kaufmann, 1997). He holds a B.S. from Cornell University and a Ph.D. from University of Toronto. A summary of his current research on meta data management can be found at http://www.research.microsoft.com/~philbe.

## 10 June 2002, 11:00 AM

 Title: Location Management in Moving Objects Databases Speaker: Ouri Wolfson, University of Illinois at Chicago Abstract: Consider applications of a database that models information about moving objects and their location. For example, given a database representing the location of taxi-cabs, a typical query may be: retrieve the free cabs that are currently within 1 mile of a customer at 33 N. Michigan Ave.. Military applications utilizing moving objects databases arise in the context of the digital battlefield, and civilian ones arise in transportation systems and in systems that track mobile computers for providing context awareness. Currently, moving objects database applications are being developed in an ad hoc fashion. Database Management System (DBMS) technology provides a potential foundation upon which to build these applications, however, there is a critical set of needed capabilities that are lacking in existing DBMS's. These include support for continuously changing data, for integrated spatial and temporal information, and for uncertainty management. The objective of our DOMINO project is to build an envelope containing these capabilities on top of existing DBMS's. In this talk I will describe the problems addressed by the project, and our proposed solutions. Bio: Ouri Wolfson's main research interests are in database systems, distributed systems, transaction processing, and mobile computing. He received his B.A. degree in mathematics, and his Ph.D. degree in computer science from Courant Institute of Mathematical Sciences, New York University. He is currently a Professor in the Department of Computer Science at the University of Illinois at Chicago, where he directs the Databases and Mobile Computing Laboratory, and the newly established Mobile Information Systems Research Center. He served as a consultant to Argonne National Laboratory, to the US Army Research Laboratories, and to the Center of Excellence in Space Data and Information Sciences at NASA. He is also the founder of Mobitrac, a high-tech startup company specializing in infrastructure software for location based services and products. Before joining the University of Illinois he has been on the computer science faculty at the Technion and Columbia University, and he has been a Member of Technical Staff at Bell Laboratories. Ouri Wolfson authored over ninety publications in leading journals and conference proceedings. He is a Fellow of the Association of Computing Machinery, an editor of the ACM/URSI/Baltzer Wireless Networks Journal, a Member of the ACM SIGMOD Digital Review Editorial Board and a guest editor of the ACM/Baltzer Journal on Special Topics in Mobile Networks. He is also the 2001 recipient of the UIC College of Engineering Faculty Research Award. He is currently serving as a National Lecturer for the Association of Computing Machinery professional society. He participated in numerous conferences (including ACM-SIGMOD, VLDB, PODS, ICDE, NGITS, ICDCS, MOBIDATA, DOOD, SSD, GIS, PDIS, CIKM) as a program committee member, keynote speaker, session chairman, and panelist. Most recently he was the keynote speaker at the Second International Conference on Mobile Data Management (MDM '2001), and is the program committee co-chair of the Third International Conference on Mobile Data Management (MDM 2002). He was also the General co-Chairman of the IEEE Knowledge and Data Engineering Exchange Workshop, and he serves on the Advisory Committee of the NSF Center of Research Excellence in Science and Technology, at Florida A&M University. His work has been funded by the National Science Foundation, Air Force Office of Scientific Research, Defense Advanced Research Projects Agency, NATO, US Army, the New York State Science and Technology Foundation, Hughes Research Laboratories, and Informix Co.

## 8 July 2002, 11:00 AM

 Title: Caterpillars, T-Graphs and Context Speaker: Derick Wood, Hong Kong University of Science and Technology Abstract: I will present two different ways that we developed for specifying context in documents. The first, caterpillar expressions, leads to very nice theoretical questions/problems; the second, T-graphs, supports a 70% solution that is suitable for most contexts that we need (it was used in Designer). I will compare the two methods and summarize their positive and negative aspects. Bio: Professor Wood received his BSc and PhD degrees from the University of Leeds, England, in 1963 and 1968, respectively. He was a Postdoctoral Fellow at the Courant Institute, New York University, from 1968 to 1970 before joining the Unit of Computer Science at McMaster University in 1970. He was Chair of Computer Science from 1979 to 1982. From 1982 to 1992 he was a Professor in the Department of Computer Science, University of Waterloo. For three years he served as Director of the Data Structuring Group. Before joining HKUST in 1995, he was a Professor in the Department of Computer Science, University of Western Ontario. He has published widely in a number of research areas and has written two textbooks, "Theory of Computation," published by John Wiley, and "Data Structures, Algorithms, and Performance," published by Addison-Wesley.

## 2 August 2002, 11:00 AM

 Title: Robust Space Transformations for Distance-based Operations (PDF) Speaker: Raymond Ng, University of British Columbia Abstract: For many database operations, such as nearest neighbor search, distance-based clustering and outlier detection, there is an underlying $k$-D data space in which each tuple/object is represented as a point in the space. We observe that in the presence of variability, correlation, outliers and/or differing scales, we may get unintuitive results if an inappropriate space is used. The fundamental question addressed in this paper is: What then is an appropriate space?''. We propose using a robust space transformation called the Donoho-Stahel estimator. In the first half of the paper, we show the key properties of the estimator. Of particular importance to database applications is the stability property, which says that in spite of frequent updates, the estimator does not (a) change much, (b) lose its usefulness, or (c) require re-computation. In the second half, we focus on the computation of the estimator for high-dimensional databases. We develop randomized algorithms and evaluate how well they perform empirically. The bottom-line is that the Donoho-Stahel transformation, which possesses desirable properties for database operations, can be computed efficiently as well. Bio: Raymond Ng received his B.Sc.(Hons) degree in Computer Science from the University of British Columbia in 1984, and his M.Math. degree in Computer Science from the University of Waterloo in 1986, and his Ph.D. degree in Computer Science from the University of Maryland, College Park, in 1992. He is a full professor at the University of British Columbia. His areas of research include data mining, bioinformatics, image databases, and multimedia systems. He has published numerous conference and journal papers on these topics. He is one of the associate editors for the VLDB journal and the IEEE TKDE journal. He is a program co-chair of the 2002 SIGKDD Conference, and has served as a member of program committees for many premier conferences, including ACM SIGMOD, VLDB, SIGKDD and ACM PODS.