The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2005-2006 are below, and more will be listed as we get confirmations. Please send your suggestions to M. Tamer Özsu.
Unless otherwise noted, all talks will be in room DC (Davis Centre) 1304. Coffee will be served 30 minutes before the talk.
We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes (usually in pdf format).
Database Seminar Series is supported by iAnywhere Solutions, A Sybase Company.
|Title:||Building a MetaQuerier and Beyond: A Trilogy of Search, Integration, and Mining for Web Information Access|
|Speaker:||Kevin Chang, University of Illinois, Urbana-Champaign|
While the Web has become the ultimate information repository, several major barriers have hindered today's search engines from unleashing the Web's promise. Toward tackling the dual challenges for accessing both the deep and the surface Web, I will present our "trilogy" of pursuit:
To begin with, from search to integration: As the Web has deepened dramatically, much information is now hidden on the "deep Web," behind the query interfaces of numerous searchable databases. Our 2004 survey estimated 450,000 online databases and 1,258,000 query interfaces. We thus believe that search much resort to integration: To enable access to the deep Web, we are building the MetaQuerier at UIUC for both finding and querying such online databases.
Further, from integration to mining: Toward large scale integration, to tackle the critical issue of dynamic semantics discovery, we observe our key insight that-- while the deep Web challenges us for its large scale, the challenge itself presents a unique opportunity: We believe that integration must resort to mining, to tackle the deep semantics by exploring shallow syntactic and statistic regularities hidden across large scale of sources, holistically.
Finally, from mining back to search? Beyond the MetaQuerier, such holistic mining is equally crucial for the dual challenge of semantics discovery on the surface Web. We believe such mining must resort to search, and propose to build holistic analysis into a next generation search engine by demonstrating our initial solutions.
Project URL: http://metaquerier.cs.uiuc.edu
|Bio:||Kevin Chen-Chuan Chang is an Assistant Professor in the Department of Computer Science, University of Illinois at Urbana-Champaign. He received a PhD in Electrical Engineering in 2001 from Stanford
University. His research interests are in large scale information access, with emphasis on Web information integration and top-k ranked query processing. He is the recipient of an NSF CAREER Award
in 2002, an NCSA Faculty Fellow Award in 2003, and IBM Faculty Awards in 2004 and 2005. URL:
|Title:||SMaestro: Second Generation Storage Infrastructure Management|
|Speaker:||Kaladhar Voruganti, IBM Almaden Research Center|
|Abstract:||Storage management has now become the largest component of the overall cost of owning storage subsystems. One of the key reasons for the high value of this cost is due to the limit on the amount of storage that can be managed by a single system administrator. This limit is due to the set of complex storage management tasks that a system administrator has to perform such as storage provisioning, performance bottleneck evaluation, planning for future growth, backup/restore, security violation analysis, and interaction with application, network and database system administrators. Thus, many storage vendors have introduced storage management tools to try and increase the amount of storage that can be managed by a single system administrator by trying to automate many of these tasks. However, most of these existing storage management products can generally be classified as first generation products that provide basic monitoring and workflow based action support. These tools generally lack analysis and planning functionality. The objective of this talk is to present the trends in the planning and analysis area of storage management with specific emphasis on open research problems.|
|Bio:||Kaladhar Voruganti received his BSc in Computer Engineering and PhD in Computing Science from the University of Alberta in Canada. For the past 6 six years he has been working as a research staff member at the IBM Almaden Research lab in San Jose, California. He is currently leading an multi-site research team that is working on storage management planning tools. Kaladhar has received an Outstanding Technical Achievement award for his contributions to IBM iSCSI storage controller, and another Outstanding Technical achievement award for his contributions to IBM storage management products. IBM iSCSI target controller has received the most innovative product award at Storage 2001 and Interop 2001 conferences. In the past Kaladhar has published in leading database conferences. Currently he is actively publishing in leading storage systems conferences and has received three IBM Bravo awards for his publication efforts.|
|Title:||Learning in Query Optimization|
|Speaker:||Volker Markl, IBM Almaden Research Center|
|Abstract:||Database Systems let users specify queries in a declarative language like SQL. Most modern DBMS optimizers rely upon a cost model to choose the best query execution plan (QEP) for any given query. Cost estimates are heavily dependent upon the optimizer's estimates for the number of rows that will result at each step of the QEP for complex queries involving many predicates and/or operations. These estimates, in turn, rely upon statistics on the database and modeling assumptionsthat may or may not be true for a given database. In the first part of our talk, we present research on learning in query optimization that has been carried out at the IBM Almaden Research Center. We introduce LEO, DB2's LEarning Optimizer, as a comprehensive way to repair incorrect statistics and cardinality estimates of a query execution plan. By monitoring executed queries, LEO compares the optimizer's estimates with actuals at each step in a QEP, and computes adjustments to cost estimates and statistics that may be used during the current and future query optimizations. LEO introduces a feedback loop to query optimization that enhances the available information on the database where the most queries have occurred, allowing the optimizer to actually learn from its past mistakes.
In the second part of the talk, we describe how the knowledge gleaned by LEO is exploited consistently in a query optimizer, by adjusting the optimizer's model and by maximzing information entropy.
|Bio:||Dr. Markl has been working at IBM's Almaden Research Center in San Jose,USA since 2001, conducting research in query optimization, indexing, and self-managing databases. Volker Markl is spearheading the LEO project, an effort on autonomic computing with the goal to create a self-tuning optimizer for DB2 UDB. He also is the Almaden chair for the IBM Data Management Professional Interest Community (PIC).
From January 1997 to December 2000, Dr. Markl worked for the Bavarian Research Center for Knowledge-Based Systems (FORWISS) in Munich, Germany as deputy research group manager, leading the MISTRAL and MDA projects, thereby cooperating with SAP AG, NEC, Hitachi, Teijin Systems Technology, GfK, and Microsoft Research. His MDA project, jointly with TransAction Software, developed the relational database management system TransBase HyperCube, which was awarded the European IST Prize 2001 by EUROCASE and the European Commission.
Dr. Markl also initiated and co-ordinated the EDITH EU IST project investigating the physical clustering of multiple hierarchies and its applications to GIS and Data Warehousing that now is being carried out by FORWISS and several partners from Germany, Italy, Greece, and Poland.
Volker Markl is a graduate of the Technische Universität München, where he earned a Masters degree in Computer Science in 1995. He completed his PhD in 1999 under the supervision of Rudolf Bayer. His dissertation on "Relational Query Processing Using a Multidimensional Access Technique" was honored "with distinction" by the German Computer Society (Gesellschaft für Informatik). He also earned a degree in Business Administration from the University Hagen, Germany in 1995. Since 1996, Volker Markl has published more than 30 reviewed papers at prestigious scientific conferences and journals, filed more than 10 patents and has been invited speaker at many universities and companies. Dr. Markl is member of the German Computer Society (GI) as well as the Special Interest Group on Management of Data of the Assosication for Computing Machinery (ACM SIGMOD). He also serves as program committee member and reviewer for several international conferences and journals, including SIGMOD, ICDE, VLDB, TKDE, TODS, IS, and the Computer Journal. His main research interests are on autonomic computing, query processing, and query optimization, but also include applications like data warehousing, electronic commerce and pervasive computing.
Dr. Markl's earlier professional experience include software engineer for a virology laboratory, as part of his military service; lecturer for software-engineering courses at the University of Applied Sciences in Augsburg, Germany and for programming and communications at the Technische Universität München; and consultant for a forwarding agency. He was awarded a fellowship by Siemens AG, Munich and also worked as an international intern with Benefit Panel Services, Los Angeles.
|Title:||Approximate Joins: Concepts and Techniques (PDF)|
|Speaker:||Divesh Srivastava, AT&T Labs-Research|
The quality of the data residing in information repositories and databases gets degraded due to a multitude of reasons. In the presence of data quality errors, a central problem is to identify all pairs of entities (tuples) in two sets of entities that are approximately the same. This operation has been studied through the years and it is known under various names, including record linkage, entity identification, entity reconciliation and approximate join, to name a few. The objective of this talk is to provide an overview of key research results and techniques used for approximate joins.
This is joint work with Nick Koudas.
Divesh Srivastava is the head of the Database Research Department at AT&T Labs-Research. He received his B.Tech. in Computer Science & Engineering from the Indian Institute of Technology, Bombay, India, and his Ph.D. in Computer Sciences from the University of Wisconsin, Madison, USA. His current research interests include XML databases and IP network data management.
|Title:||The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search (PDF)|
|Speaker:||Sihem Amer-Yahia, AT&T Labs-Research|
A key benefit of XML is its ability to represent a mix of structured and text data. We discuss the interplay of structured information and keyword search in three aspects of XML search: query design, scoring methods and query evaluation. In query design, existing languages for XML evolved from simple keyword search to queries combining sophisticated conditions on structure ala XPath and XQuery and complex full-text search primitives, such as the use of ontologies and keyword proximity distance, ala XQuery Full-Text. In XML scoring, methods range from a pure IR tf*idf to approximating and scoring both structure and keyword conditions. In evaluating XML search, document structure has been used to identify meaningful XML fragments to be returned as answers to keyword queries and, is
This discussion is based on published and ongoing work between AT&T Labs and UBC, The U. of Toronto, Cornell U., Rutgers U., the U. of Waterloo and UCSD.
|Bio:||Sihem Amer-Yahia is a Senior Technical Specialist at AT&T Labs Research. She received her Ph.D. degree from the University of Paris XI-Orsay and INRIA. She has been working on various aspects related to XML query processing. More lately, she has focused on XML full-text search. Sihem is a co-editor of the XQuery Full-Text language specification and use cases published in September 2005 by the W3C Full-Text Task Force. She is the main developer of GalaTex, a conformance implementation of XQuery Full-Text.|
|Title:||MobiEyes: Distributed Processing of Moving Queries over Moving Objects (PDF)|
|Speaker:||Ling Liu, Georgia Institute of Technology|
With the growing popularity and availability of mobile communications, our ability to stay connected while on the move is becoming a reality instead of science fiction just a decade ago. An important research challenge for modern location-based services is the scalable processing of location monitoring requests on a large collection of mobile objects. The centralized architecture, though studied extensively in literature, would create intolerable performance problems as the number of mobile objects grows significantly.
In this talk, we present a distributed architecture and a suite of optimization techniques for scalable processing of continuously moving location queries. Moving location queries can be viewed as standing location tracking requests that continuously monitors the locations of mobile objects of interests and return a subset of mobile objects when a certain conditions are met. We describe the design of a distributed location monitoring architecture through MobiEyes, a distributed real time location monitoring system in a mobile environment. The main idea behind the MobiEyes distributed architecture is to promote a careful partition of a real time location monitoring task into an optimal coordination of server-side processing and client-side processing. Such a partition allows the location of a moving object to be computed with a high degree of precision using a small number of location updates or no updates at all, thus providing highly scalable and more cost-effective location monitoring services. Concretely, the MobiEyes distributed architecture not only encourages a careful utilization of the rapidly growing computational power available at various mobile devices, such as cell phones, hand helds, GPS devices, but also endorses a strong coordination agreement between the mobile objects and the server. Such an agreement supports varying location update rates for different mobile users at different times, and advocates the exploitation of location predication and location inference to further constrain the resource/bandwidth consumption while maintaining the satisfactory precision of location information. A set of optimization techniques are used to further limit the amount of computations to be handled by the mobile objects and enhance the overall performance and system utilization of MobiEyes. Important metrics to validate the proposed architecture and optimizations include messaging cost, server load, and amount of computation at individual mobile objects. Our experimental results show that the MobiEyes approach can lead to significant savings in terms of server load and messaging cost when compared to solutions relying on central processing of location information at the server. If time permits, at the end of my talk, I will also give an overview of the location privacy protection in LBS.
|Bio:||Ling Liu is currently an associate professor at the College of Computing at Georgia Tech. She directs the research programs in Distributed Data Intensive Systems lab, examining research issues and technical challenges in building scalable and secure distributed data intensive systems. Her current research interests include performance, scalability, security and privacy issues in networked computing systems and applications, in particular, mobile location based services and distributed enterprise computing systems. She has published over 150 international journal and conference articles. She has served as a PC chair of several IEEE conferences, including the co-PC chair of IEEE 2006 International Conference on Data Engineering (ICDE 06), the vice chair of the Internet Computing track of the IEEE 2006 International Conference on Distributed Computing (ICDCS 06), and is on the editorial board of several international journals, including an associate editor of IEEE Transactions on Knowledge and Data Engineering (TKDE), International Journal of Very Large Databases (VLDBJ), and International Journal of Web Service Research. Most of Dr. Liu's recent research has been sponsored by NSF, DoE, DARPA, IBM, and HP.|
|Title:||Implementing XQuery 1.0: The Story of Galax (PDF)|
|Speaker:||Mary Fernández, AT&T Labs - Research|
XQuery 1.0 and its sister language XPath 2.0 have set a fire underneath database vendors and researchers alike. More than thirty commercial and research XQuery implementations are listed on the XML Query working group home page. Galax (www.galaxquery.org) is an open-source, general-purpose XQuery engine, designed to be complete, efficient, and extensible. During Galax's development, we have focused on each of these three requirements in turn, while never losing sight of the other two.
In this talk, I will describe how these requirements have impacted Galax's evolution and our own research interests. Along the way, I will show how Galax's architecture supports these three requirements.
Galax is joint work with Jérôme Siméon, IBM T.J. Watson Research Center.
|Bio:||Mary Fernandez is Principal Technical Staff at AT&T Labs - Research. Her research interests include data integration, Web-site implementation and management, domain-specific languages, and their interactions. She is a member of the W3C XML Query Language Working Group, co-editor of several of the XQuery W3C working drafts, and is a principal designer and implementor of Galax, a complete, open-source implementation of XQuery (www.galaxquery.org). Mary is also an associate editor of ACM Transactions on Database Systems and serves on the advisory board of MentorNet (www.mentornet.net), an e-mentoring network for women in engineering and science.|
Discovering Interesting Subsets of Data in Cube Space
|Speaker:||Raghu Ramakrishnan, University of Wisconsin - Madison|
|Abstract:||Data Cubes have been widely studied and implemented, and so we researchers shouldn't be thinking about them anymore, right? Wrong. In this talk, I'll try to convince you that the multidimensional model of data ("cube" sounds so much cooler) provides the right perspective for addressing many challenging tasks, including dealing with imprecision, mining for interesting subsets of data, analysis of historical stream data, and world peace. The talk will touch upon results from a couple of VLDB 2005 papers, and some recent ongoing work.|
|Bio:||Raghu Ramakrishnan is Professor of Computer Sciences at the University of Wisconsin-Madison, and was founder and CTO of QUIQ, a company that pioneered collaborative customer support (acquired by Kanisa). His research is in the area of database systems, with a focus on data retrieval, analysis, and mining. He and his group have developed scalable algorithms for clustering, decision-tree construction, and itemset counting, and were among the first to investigate mining of continuously evolving, stream data. His work on query optimization and deductive databases has found its way into several commercial database systems, and his work on extending SQL to deal with queries over sequences has influenced the design of window functions in SQL:1999.
He is Chair of ACM SIGMOD, on the Board of Directors of ACM SIGKDD and the Board of Trustees of the VLDB Endowment, an associate editor of ACM Transactions on Database Systems, and was previously editor-in-chief of the Journal of Data Mining and Knowledge Discovery and the Database area editor of the Journal of Logic Programming. Dr. Ramakrishnan is a Fellow of the Association for Computing Machinery (ACM), and has received several awards, including a Packard Foundation Fellowship, an NSF Presidential Young Investigator Award, and an ACM SIGMOD Contributions Award. He has authored over 100 technical papers and written the widely-used text "Database Management Systems" (WCB/McGraw-Hill), now in its third edition (with J. Gehrke).
|Title:||Racer - Optimizing in ExpTime and Beyond: Lessons Learnt and Challenges Ahead|
|Speaker:||Volker Haarslev, Concordia University|
In February 2004 the Web Ontology Language (OWL) was adopted by the W3C as a recommendation and emerged as a core standard for knowledge representation in the web. The sublanguage OWL-DL is a notational variant of the well-known description logic SHOIN(Dn-), which has decidable inference problems but is also known to be NexpTime-complete. The availability of OWL-DL caused a significant interest in OWL-compliant assertional description logic reasoners. Racer was the first highly optimized assertional reasoner for the very expressive (ExpTime-complete) description logic SHIQ(D-), which covers most parts of OWL-DL with the exception of so-called nominals.
In this talk I will briefly introduce description logics / OWL-DL and associated inferences services. Afterward I will discuss the architecture of the description logic reasoner Racer and highlight selected tableau optimization techniques, especially on assertional reasoning and its relationship to database technology. Several recently devised optimization techniques were introduced due to requirements from semantic web applications relating huge amounts of (incomplete) data to ontological information. I will conclude my presentation with an outlook on OWL 1.1 and ongoing and future description logic research such as explanation of reasoning and adding uncertainty as well as database support in Racer Pro.
The research on Racer is joint work with Ralf Moeller, Hamburg University of Technology.
Dr. Haarslev obtained his doctoral degree from the University of Hamburg, Germany, specializing in user interface design. His early research work was in compilers, interfaces and visual languages. His current work is in automated reasoning, especially description logics, which play important roles in database technology and Internet technology. For databases, description logics allow the integration of heterogeneous data sources. For Internet technology, description logics are the logical foundation of the web ontology language (OWL) and form the basis of the semantic web, the emerging next generation of the World Wide Web.
Dr. Haarslev is internationally regarded for his substantial research contributions in the fields of visual language theory and description logics. He is a principal architect of the description logic and OWL reasoner Racer, which can be considered as a key component for the emerging semantic web. Dr. Haarslev holds the position of Associate Professor in the Department of Computer Science and Software Engineering in Concordia University. He leads a research group working on automated reasoning and related database technology in the context of the semantic web. Dr. Haarslev is also cofounder of the company Racer Systems, which develops and distributes Racer Pro, the commercial successor of Racer.
|Title:||Entity Resolution in Relational Data (PDF)|
|Speaker:||Lise Getoor, University of Maryland|
|Abstract:||A key challenge for data mining is tackling the problem of mining richly structured datasets, where the objects are linked in some way. Links among the objects may demonstrate certain patterns, which can be helpful for many data mining tasks and are usually hard to capture with traditional statistical models. Recently there has been a surge of interest in this area, fueled largely by interest in web and hypertext mining, but also by interest in mining social networks, security and law enforcement data, bibliographic citations and epidemiological records.
In this talk, I'll begin with a short overview of this newly emerging research area. Then, I will describe some of my group's recent work on link-based classification and entity resolution in relational domains. I'll spend the majority of time describing our work on entity resolution. I'll describe the framework and algorithms that we have developed, present results on several real world datasets and our work on making the algorithms scalable.
Joint work with students: Indrajit Bhattacharya, Mustafa Bilgic, Louis Licamele and Prithviraj Sen.
|Bio:||Prof. Lise Getoor is an assistant professor in the Computer Science Department at the University of Maryland, College Park. She received her PhD from Stanford University in 2001. Her current work includes research on link mining, statistical relational learning and representing uncertainty in structured and semi-structured data. Her work in these areas has been supported by NSF, NGA, KDD, ARL and DARPA. In July 2004, she co-organized the third in a series of successful workshops on statistical relational learning, http://www.cs.umd/srl2004. She has published numerous articles in machine learning, data mining, database and AI forums. She is a member of AAAI Executive council, is on the editorial board of the Machine Learning Journal and JAIR and has served on numerous program committees including AAAI, ICML, IJCAI, KDD, SIGMOD, UAI, VLDB, and WWW.|
|Title:||Nile: Data Streaming in Practice|
|Speaker:||Walid Aref, Purdue University|
Emerging data streaming applications pose new challenges to database management systems. In this talk, I will focus on two applications, namely mobile objects and phenomena detection and tracking applications. I will highlight new challenges that these applications raise and how we address them in the context of Nile, a data stream management system being developed at Purdue. In particular, I will present new features of Nile, including incremental evaluation of continuous queries, supporting "predicate windows" using views, and stream query processing with relevance feedback. I will demonstrate the use and performance gains of these features in the context of the above two applications. Finally, I will talk about ongoing research in Nile and directions for future research.
|Bio:||Walid G. Aref is a professor of computer science at Purdue. His research interests are in developing database technologies for emerging applications, e.g., spatial, spatio-temporal, multimedia, bioinformatics, and sensor databases. He is also interested in indexing, data mining, and geographic information systems (GIS). Professor Aref's research has been supported by the National Science Foundation, Purdue Research Foundation, CERIAS, Panasonic, and Microsoft Corp. In 2001, he received the CAREER Award from the National Science Foundation and in 2004, he received a Purdue University Faculty Scholar award. Professor Aref is a member of Purdue's Discovery Park Bindley Bioscience and Cyber Centers. He is on the editorial board of the VLDB Journal, a senior member of the IEEE, and a member of the ACM.|
|Title:||Data Mining using Fractals and Power Laws (PDF)|
|Speaker:||Christos Faloutsos, CMU|
|Abstract:||What patterns can we find in a bursty web traffic? On the web or on the internet graph itself? How about the distributions of galaxies in the sky, or the distribution of a company's customers in geographical space? How long should we expect a nearest-neighbor search to take, when there are 100 attributes per patient or customer record? The traditional assumptions (uniformity, independence, Poisson arrivals, Gaussian distributions), often fail miserably. Should we give up trying to find patterns in such settings?
Self-similarity, fractals and power laws are extremely successful in describing real datasets (coast-lines, rivers basins, stockprices, brain-surfaces, communication-line noise, to name a few). We show some old and new successes, involving modeling of graph topologies (internet, web and social networks); modeling galaxy and video data; dimensionality reduction; and more.
|Bio:||Christos Faloutsos holds a Ph.D. degree in Computer Science from the University of Toronto, Canada. He is currently a professor at Carnegie Mellon University. He has received the Presidential Young Investigator Award by the National Science Foundation (1989), seven "best paper" awards, and four teaching awards. He has published over 130 refereed articles, one monograph, and holds four patents. His research interests include data mining, fractals, indexing in multimedia and bio-informatics databases, and database performance.|
|Title:||Dynamic Programming for Join Ordering Revisited (PDF)|
|Speaker:||Guido Moerkotte, University of Mannheim|
|Abstract:||Two approaches to derive dynamic programming algorithms for constructing join trees are described in the literature. We show analytically and experimentally that these two variants exhibit vastly diverging runtime behaviors for different query graphs. More specifically, each variant is superior to the other for one kind of query graph (chain or clique), but fails for the other. Moreover, neither of them handles star queries well. This motivates us to derive an algorithm that is superior to the two existing algorithms because it adapts to the search space implied by the query graph.|
|Bio:||From 1981 to 1987 Guido Moerkotte studied computer science at the Universities of Dortmund, Massachusetts, and Karlsruhe. The University of Karlsruhe awarded him a Diploma (1987), a doctorate (1989), and a postdoctoral lecture qualification (1994). In 1994 he became an associate professor at the RWTH Aachen. Since 1996 he holds a full professor position at the University of Mannheim where he heads the database research group. His research interests include databases and their applications, query optimization, and XML databases. Guido Moerkotte (co-) authored more than 100 publications and three books.|
|Title:||A System for Data, Uncertainty, and Lineage (PDF)|
|Speaker:||Jennifer Widom, Stanford University|
|Abstract:||Trio is a new type of database system that manages uncertainty and lineage of data as first-class concepts, along with the data itself. Uncertainty and lineage arise in a variety of data-intensive applications, including scientific and sensor data management, data cleaning and integration, and information extraction systems. This talk will survey our recent and current work in the Trio project: the extended-relational "ULDB" model upon which the Trio system is based, Trio's SQL-based query language (TriQL) including formal and operational semantics, a selection of new theoretical challenges and results, Trio's initial prototype implementation, and our planned research directions.
Trio web site: http://www-db.stanford.edu/trio/
|Bio:||Jennifer Widom is a Professor in the Computer Science and Electrical Engineering Departments at Stanford University. She received her Bachelors degree from the Indiana University School of Music in 1982 and her Computer Science Ph.D. from Cornell University in 1987. She was a Research Staff Member at the IBM Almaden Research Center before joining the Stanford faculty in 1993. Her research interests span many aspects of nontraditional data management. She is an ACM Fellow and a member of the National Academy of Engineering, was a Guggenheim Fellow, and has served on a variety of program committees, advisory boards, and editorial boards.|