Database Seminar Series (2003-2004) | Data Systems Group

The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2003-2004 are below, and more will be listed as we get confirmations. Please send your suggestions to M. Tamer Özsu.

Unless otherwise noted, all talks will be in room DC (Davis Centre) 1304. Coffee will be served 30 minutes before the talk.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes (usually in pdf format).

Database Seminar Series is supported by iAnywhere Solutions, A Sybase Company.

Divesh Srivastava

Louiqa Raschid

Elke Rundensteiner

Klemens Böhm

Mario Nascimento

Z. Meral Özsoyoglu

Anastassia Ailamaki

Nick Koudas

Ken Ross

22 September 2003, 11:00 AM

Title:	Phrase Matching in XML (PDF)
Speaker:	Divesh Srivastava, AT&T Labs-Research
Abstract:	Phrase matching is a common IR technique to search text and identify relevant documents in a document collection. Phrase matching in XML presents new challenges as text may be interleaved with arbitrary markup, thwarting search techniques that require strict contiguity or close proximity of keywords. We present a technique for phrase matching in XML that permits dynamic specification of both the phrase to be matched and the markup that should be ignored. We develop an efficient algorithm for our technique that utilizes inverted indices on phrase words and XML tags. We describe experimental results comparing our algorithm to an indexed-nested loop algorithm that illustrate our algorithm's effectiveness. Finally, we show a demo of the PIX system prototype that we built for phrase matching in XML. This is joint work with Sihem Amer-Yahia, Mary Fernandez, and Yu Xu.
Bio:	Divesh Srivastava is the head of the Database Research Department at AT&T Labs-Research. He received his B.Tech. in Computer Science & Engineering from the Indian Institute of Technology, Bombay, India, and his Ph.D. in Computer Sciences from the University of Wisconsin, Madison, USA. His current research interests include XML databases and IP network data management.

20 October 2003, 11:00 AM

Title:	Efficient Exploration of Links and Paths in Life Sciences Data Sources
Speaker:	Louiqa Raschid, University of Maryland
Abstract:	Life science data sources contain data on classes of scientific entities such as genes and sequences. Each source may have data on one or more classes. There is significant diversity in the contents of sources, links between sources, overlap among sources, and the computational capabilities of these sources. In this talk, I will summarize two approaches to address the efficient exploration of sources. The first involves algorithms to explore the search space of links and paths between biological data sources, and to efficiently identify paths that satisfy a navigational query (regular expression). The second is to develop a framework to determine the properties that characterize (multiple alternate) links and paths between two sources. I will then define two problems related to optimization. The task is to determine a path that satisfies a query, has a high ranking (with respect to some desired semantic criteria) and can be evaluated by a plan with a low evaluation cost. The second task is to choose a subset of paths so as to maximize the benefit (with respect to the semantic criteria subject to some upper bound on the evaluation cost. Joint work with Zoe Lacroix, Felix Naumann and Maria Esther Vidal. If there is time, I will also talk about the following: Latency Profiles: Performance Monitoring for Wide Area Applications Recent technological advances have enabled the deployment of wide area applications against Internet accessible sources. A performance challenge to applications in such a setting is the unpredictable end-to-end latency of accessing these sources. We use passive information gathering mechanisms to learn end-to-end latency distributions and construct Latency Profiles (LPs). We hypothesize that a group of clients, within an autonomous system (AS), that are accessing a content server, in another AS, may be represented by (one or more) LPs. Related networking research on IDMaps, points of congestion, and BGP routes support such hypothesis. We develop aggregate LPs to provide coverage of groups (clusters) of client-server pairs. Using data gathered from a (limited) experiment we demonstrate the feasibility of constructing LPs and aggregate LPs. Joint work with Avigdor Gal, Vladimir Zadorozhny and Hui-Fang Wen
Bio:	Louiqa Raschid received a Bachelor of Technology in electrical engineering from the Indian Institute of Technology, Madras, in 1980, and a Ph.D. in electrical engineering from the University of Florida, Gainesville, in 1987. She is a Professor in the Smith School of Business at the University of Maryland and holds a joint appointment with the Institute for Advanced Computer Studies and the Department of Computer Science. Dr. Raschid's research interests include the following: Scalable architectures for wide area computation with heterogeneous information servers, e.g., scientific discovery from biomolecular data sources; Cost models and query planning and optimization techniques for dynamic wide-area environments; Profile (utility) based tuning of data delivery to clients in fixed and mobile networks; Publishing and locating sources based on quality and content metadata using the WWW and XML. Her research has been published in high quality conferences (ACM SIGMOD, VLDB, IEEE ICDE and ICDCS, EDBT) and journals (ACM TODS, IEEE TKDE, IEEE ToC, the Journal of Logic Programming, the VLDB Journal and the Journal of Parallel and Distributed Computing). Her research is supported by grants from the National Science Foundation and the Defense Advanced Research Projects Agency. She was recognized by the Vice President of Research at the University of Maryland to be among the Top 100 Faculty in obtaining research funding. Dr. Raschid serves on the editorial board of the IEEE Transactions on Knowledge and Data Engineering, the Journal of Very Large Data Bases, and the INFORMS Journal of Computing. She has served on the Program Committee of all the major database conferences (ACM Sigmod, VLDB and ICDE), the WWW Conference, IEEE Advances in Digital Libraries, and the International Conference on Information Systems.

26 November 2003, 10:30 AM (Please note the changed date and time)

Title:	Raindrop: The WPI Stream Query Engine (PDF)
Speaker:	Elke Rundensteiner, Worcester Polytechnic Institute
Abstract:	Many modern applications process queries over unbounded streams of data, including intrusion detection in networks, monitoring remote sensors, and monitoring patients vital signs. These data streams arrive in real time, are unbounded in length and have unpredictable arrival patterns due to external uncontrollable factors such as network congestion or weather in the case of remote sensors. In this talk, I'll describe the recent research and system development activities we have been undergoing at Worcester Polytechnic Institute in this emerging area of stream query engines. One, I'll describe the novel approach we are taking towards processing XQueries over XML streams. XML stream applications bring the challenge of efficiently processing queries on sequentially accessible token-based data. While the automata model is naturally suited for pattern matching on tokenized XML streams, the algebraic model in contrast is a well-established technique for set-oriented processing of self-contained tuples. However, neither automata nor algebraic models are well-equipped to handle both computation paradigms. Our Raindrop engine thus accommodates these two paradigms within one algebraic framework to take advantage of both. In our query model, both tokenized data and self-contained tuples are supported in a uniform manner. Query plans can be flexibly rewritten using equivalence rules to change what computation is done using tokenized data versus tuples. The necessity of such a uniform plan model are been illustrated using an experimental study. Two, we have designed a novel technique for adapting the execution of stream queries that, to my knowledge, is not present in any other continuous query system to date. Namely, we note that utilizing a single scheduling algorithm to execute a continuous query, as is employed in most current continuous query systems, is not adequate because no existing scheduling algorithms can optimally meet an arbitrary set of Quality of Service (QoS) requirements. Therefore, an adaptive strategy selector guidable by QoS requirements was developed to meet arbitrary Quality of Service (QoS) requirements. The adaptive strategy selector monitors the effects of its behavior on its environment through a feedback mechanism, with the aim of exploiting previously beneficial behavior and exploring alternative behavior. The feedback mechanism is guided by qualitatively comparing how well each algorithm has met the QoS requirements. Then the next scheduling algorithm is chosen by spinning a roulette wheel where each candidate is chosen with a probability equal to its performance score. The adaptive algorithm was shown to be effective in outperforming single scheduling algorithms for many QoS combinations and stream arrival patterns. Three, we also have been studying methods for online plan reoptimization and plan migration so to online modify the structure of the query plan. Dynamic query plan migration is concerned with the on-the-fly transition from one continuous query plan to a semantically equivalent yet more efficient query plan. In particular, we have designed two alternate online plan migration strategies for continuous queries over data streams for plans with window-based join operators. The first strategy is called "moving state strategy" and the second is called "parallel track strategy". In the moving state strategy, the query execution is paused and all tuples in intermediate queues are drained before switching to a new plan. In the parallel track strategy, the new plan is plugged and the system executes both plans in parallel until eventually disconnecting the old plan. Pros and Cons of these techniques are discussed, and an experimental study exposing the trade-offs between these techniques will be analyzed.
Bio:	Elke A. Rundensteiner is an Associate professor of the Department of Computer Science at the Worcester Polytechnic Institute, after having been a faculty member at the University of Michigan, Ann Arbor. She has received a BS degree (Vordiplom) from the J. W. Goethe University, Frankfurt, West Germany, an M.S. degree from Florida State University, and a Ph.D. degree from the University of California, Irvine, all in Computer Science. Dr. Rundensteiner has been active in the database and information systems research community for over 15 years. Her current research interests include data integration and migration, XML and web data management, data warehousing for distributed systems, continuous query processing, and large-scale information visualization. She has more than 100 publications in these and related areas. Her research has been funded by government agencies including NSF and industry like Verizon Labs, GTE, and others. Dr. Rundensteiner has received numerous honors and awards, including Fulbright, NSF Young Investigator, and IBM Partnership Award. Elke Rundensteiner has served on program committees of numereous conferences in the database field, including ACM SIGMOD, VLDB, and IEEE Data Engineering. She has been an editor of several journals, including Associate Editor of the IEEE Transactions on Data and Knowledge Engineering Journal.

15 December 2004, 11:00 AM

Title:	Incentives and Sanctions in Peer-to-Peer Information Systems
Speaker:	Klemens Böhm, Otto-von-Guericke-Universität Magdeburg
Abstract:	Modern information systems require that users provide resources. These resources may be on the physical level, or they may be intellectual input. But only in rare cases, incentives to provide these resources exist. Our research objective is the development of decentralized mechanisms to stabilize cooperative behavior in Peer-to-Peer systems and to reduce destructive behavior and free riding. This talk will focus on these issues in the context of Content-Addressable Networks (CAN). CAN administer huge sets of (key, value)-pairs and cope with very high workloads. They consist of many autonomous nodes. However, this also means that nodes may be uncooperative, i.e., not carrying out their share of the work while trying to benefit from the network. This talk deals with this kind of adverse behavior in CAN. It is challenging for large networks of more than 100,000 nodes to design a forwarding protocol for CAN that bypasses and excludes uncooperative nodes. We have designed and evaluated such a CAN protocol that is based on reputation, with the following characteristics: It establishes logical trust networks within the CAN. Nodes give positive feedback on peers that have performed useful work. Feedback is distributed in a swarm-like fashion and expires after a certain amount of time. In extreme situations, the CAN asks nodes to perform a proof of work, in order to deter uncooperative behavior. Results of experiments with 100,000 peers are positive. Finally, we will address other scenarios where our main concern (decentralized mechanisms to ensure cooperative behavior) is important as well.
Bio:	Klemens Böhm is a professor (Associate Professor) at Otto-von-Guericke-Universität Magdeburg, Germany, since 2002. Prior to that, he has served as a research assistant in the database research group of Swiss Federal Institute of Technology (ETH), Zurich. Klemens received his Ph.D. degree and his diploma in computer science from Technical University of Darmstadt in 1997 and 1993, respectively. His current research interests are Peer-to-Peer information systems and cooperative work.

12 January 2004, 11:00 AM

Title:	Effective and Efficient Region-based Image Retrieval (PDF)
Speaker:	Mario Nascimento, University of Alberta
Abstract:	Content Based Image Retrieval (CBIR) is a challenging task. Current research works attempt to obtain and use the semantics of image to perform better retrieval. Towards this goal, segmentation of an image into regions has been advocated, since local properties of regions can help matching objects between images and thereby contribute towards a more effective CBIR. In this talk I will introduce three variations of a CBIR technique we proposed recently, called SNL. In SNL each image is segmented and features including the color, shape, size and spatial position of the obtained regions are extracted. Regions are then compared using the Integrated Region Matching (IRM) distance measure, which, not being a metric distance, prevents the use of metric access structures. We overcome this issue, by using MiCRoM, a true metric distance to compare segmented images. The resulting approach, which we call SNL, can be used in conjunction with a filtering technique to reduce substantially the number of images compared, hence speeding up query processing. Albeit metric-based, SNL is computationally expensive. We address this drawback in the SNL+ approach, where we replace the expensive metric distance in SNL* by the inexpensive (but non-metric) IRM distance. We found that one can still make use of the same filtering technique, at the expense of little loss in retrieval effectiveness. Thus, the main contribution we offer is SNL+, an effective and efficient region-based image retrieval technique. The talk is based on the paper: Mario A. Nascimento, Veena Sridhar, Xiaobo Li: Effective and efficient region-based image retrieval. J. Vis. Lang. Comput. 14(2): 151-179 (2003).
Bio:	A. Nascimento obtained his Ph.D. degree in Computer Science at Southern Methodist University's School of Engineering in 1996. Between 1989 and 1999 he was a researcher with the Brazilian Agency for Agricultural Research (Information Technology Center) and, between 1997 and 1999, he was also associated with the Institute of Computing of the State University of Campinas (Brazil). Since then he has been with the Department of Computing Science of the University of Alberta. He has published over fifty papers in international conferences, journals and workshops, has served or is serving as Program Committee member of several conferences and as Program Chair or Co-chair for ACM MM 2001 MIR, IDEAS'02 SPIRE'2003, and as a member of the ACM SIGMOD DiSC Editorial Board (2001 and 2002). He is currently serving as a member of ACM SIGMOD's Executive Board (as its Information Director). His main research interests lie in the areas of Content-based Image Retrieval and Access Structures for (Spatio/Temporal/Image) Databases He is a member of ACM, SIGMOD, SIGIR, and IEEE Computer Society. Further and current information can be found at http://www.cs.ualberta.ca/~mn.

16 February 2004, 11:30 AM (Please note special time)

Title:	Distance Based Indexing and Similarity Search for Sequences (PDF)
Speaker:	Z. Meral Özsoyoglu, Case Western Reserve University
Abstract:	Finding sequences similar to a given query sequence in a large collection of sequences is a fundamental problem in many database applications including, computational genomics, computational finance, image and text processing. The similarity between sequences is defined in terms of a distance function determined by the application domain. In this work, we consider sequence proximity search in computational genomics, where sequence similarity is usually an indication of an evolutionary relationship between DNA and protein sequences, and usually indicates functional similarity. The most popular distance measures are based on (a weighted) count of character edit or block edit operations to transform one string to another. The main goal is to develop efficient near neighbor search tools that work for both character edit and block edit distances. Our premise is that the Distance Based Indexing techniques, which are originally developed for metric distances can be modified for sequence distance measures provided that they are almost metrics. We first show that sequence distance functions of interest (compression distance and weighted character edit distance) are almost metric. We then show how to modify distance based index structures vantage point trees to accommodate almost metric distances. We test our theoretical results on protein sequences and synthetic data sets.
Bio:	Z. Meral Özsoyoglu is a professor of Computer Science, and Associate Chair for Division of Computer Science at EECS department, Case Western Reserve University, Cleveland, Ohio. Prof. Ozsoyoglu's primary work and research interests are in the areas of query languages and query processing, data models, and index structures in object oriented databases, multimedia and temporal databases. More recently, she has been working in Bioinformatics, more specifically on managing, visualizing and querying genomic pathways, and efficient access methods for genomic sequences. She is a founding member of CWRU Center for Computational Genomics. Dr. Ozsoyoglu has published over 60 papers in computer science journals including ACM TODS, IEEE TKDE, IEEE TSE, JCSS, and conferences including ACM SIGMOD, ACM PODS, VLDB, and IEEE ICDE. She is currently an associate editor of ACM TODS, IEEE Data Engineering Bulletin, and WWW Journal.

15 March 2004, 11:00 AM

Title:	FATES: Decoupling Main Memory Data Layout from Storage Organization
Speaker:	Anastassia Ailamaki, Carnegie Mellon University
Abstract:	Current database systems use data layouts that can exploit unique features of only one level of the memory hierarchy (cache/main memory or on-line storage). Such layouts optimize for the predominant access pattern of one workload (e.g., DSS), while trading off performance of another workload type (e.g., OLTP). Achieving efficient execution of different workloads without such trade-off or the need to manually re-tune the system for each workload type is still an unsolved problem. In addition, current database workloads are often compound (i.e., consisting of both DSS and OLTP requests). Such workloads are difficult to optimize and execute, because optimization assumptions rarely hold at execution time. In this talk, I will present the design of a multi-layer storage and main memory management system for relational databases called Fates, that achieves efficient execution of compound database workloads at all levels of the memory hierarchy. By leveraging unique characteristics of devices at each level, Fates can automatically match query access patterns to the respective device characteristics, which eliminates the difficult and error-prone task of manual performance tuning. Borrowing from the Greek mythology of The Three Fates-Clotho, Lachesis, and Atropos-who spin, measure, and cut the thread of life, the three components of our database system (bearing the Fates' respective names) establish proper abstractions in the database query execution engine. The talk will describe all three components and will show experimental results.
Bio:	Anastassia Ailamaki received a B.Sc. degree in Computer Engineering (1990) from the Polytechnic School of the University of Patra, Greece, M.Sc. degrees from the Technical University of Crete, Greece and from the University of Rochester, NY, and a Ph.D. degree in Computer Science from the University of Wisconsin-Madison. She is currently an Assistant Professor at the School of Computer Science at Carnegie Mellon University. She has received two best-paper awards (VLDB 2001 and Performance 2002) and an NSF CAREER award (2002). Her current research interests include database system design and performance, architecture-conscious databases, internet querying and caching, and workload characterization.

19 April 2004, 11:30 AM (Please note special time)

Title:	Flexible String Matching in Databases
Speaker:	Nick Koudas, University of Toronto
Abstract:	In this talk I will provide an overview of our work on flexible string matching in databases for data cleaning applications. I will show how it is possible to embedded popular metrics that quantify string closeness into an SQL framework and express them in a declarative form. The end result is a set of techniques that can assist in identifying pairs of strings that are 'similar' (with respect to those metrics) directly, using SQL on any relational database.
Bio:	Nick Koudas received his PhD from the university of toronto, his M.Sc from the University of Maryland at College park and his B.Sc from the university of Patras. He is an associate editor for Information Systems and IEEE TKDE and served in the past as a SIGMOD online associate information editor. His interests are in database systems, metadata management, network data management.

7 June 2004, 11:00 AM

Title:	Architecture Sensitive Design of Database Engines
Speaker:	Ken Ross, Columbia University
Abstract:	With modern machines having large main memories, database system performance has become less dependent on I/O characteristics. Current performance bottlenecks include memory latency (for both data and instructions) and branch misprediction delays. At the same time, commodity processors provide features (such as multimedia instruction sets like MMX and SSE) that can potentially enhance the performance of data intensive operations. In this talk, I will give an overview of how our group has addressed these issues in designing algorithms and data structures for database operations. I will then explore in more depth how buffering can be helpful, in several contexts, for avoiding cache misses. Buffering can improve throughput while still providing certain response-time guarantees. This talk represents joint work with Jingren Zhou.
Bio:	Professor Ross is an associate professor in the Computer Science Department at Columbia University in New York City. His research interests touch on various aspects of database systems, including query processing, query language design, data warehousing, and architecture-sensitive database system design. Professor Ross received his PhD from Stanford University. He has received several awards, including a Packard Foundation Fellowship, a Sloan Foundation Fellowship, and an NSF Young Investigator award.