Database Seminar Series (2008-2009)

The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2008-2009 are below.

Unless otherwise noted, all talks will be in room DC 1304. Coffee will be served 30 minutes before the talk.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes (usually in pdf format).


Database Seminar Series is supported by iAnywhere Solutions, A Sybase Company.


Anhai Doan
J. Stephen Downie
Jerome Simeon
Marianne Winslett
Zack Ives
Kevin Beyer
Mariano Consens
Chris Olston
Boon Thau Loo

15 September 2008, 10:30 AM

Title: Managing Unstructured Data
Speaker: Anhai Doan, University of Wisconsin
Abstract:

This talk argues that our community should seriously consider the problem of managing unstructured data (e.g., text, Web pages, emails, memos, and ads). While many other research communities have also been long laboring on this problem, I argue that we can make unique contributions, by adopting a structure and system focus: (a) make it very easy to extract structures from the raw data, and (b) build end-to-end systems, instead of working on just isolated research problems.

I then describe Cimple, a joint project between Wisconsin, Yahoo, and Microsoft that works toward the above goals. First I describe our current protype system that employs extraction, integration, and mass collaboration to manage unstructured data. Next, I discuss current efforts in applying the system to a range of applications, including structured Web portals, on-the-fly data integration, personal information management, and ad management. Finally, I discuss lessons learnt from the current Cimple work, and ways forward.

Bio: AnHai Doan is an associate professor in Computer Science at the University of Wisconsin-Madison. His interests cover databases, AI, and Web. His current research focuses on managing unstructured data, data integration, Web community management, mass collaboration, text management, and information extraction. Selected recent honors include the ACM Doctoral Dissertation Award in 2003, CAREER Award in 2004, and Alfred P. Sloan Research Fellowship in 2007. Selected recent professional activities include co-chairing WebDB at SIGMOD-05 and the AI Nectar track at AAAI-06.

Friday, 17 October 2008, 2:00 PM

Title: The Music Information Retrieval Evaluation eXchange (MIREX): An Introductory Overview
Speaker: J. Stephen Downie, University of Illinois at Urbana-Champaign
Abstract:

The Music Information Retrieval Evaluation eXchange (MIREX) is a community-based formal evaluation framework coordinated and managed by the International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL) at the University of Illinois at Urbana-Champaign (UIUC). IMIRSEL has been funded by both the National Science Foundation and the Andrew W. Mellon foundation to create the necessary infrastructure for the scientific evaluation of the many different techniques being employed by researchers interested in the domains of Music Information Retrieval (MIR) and Music Digital Libraries (MDL). This presentation is intended to provide an overview of MIREX including the unique challenges posed when trying evaluate specialized music information retrieval systems. This talk will place MIREX in the historical context of the International Conferences on Music Information Retrieval Systems (ISMIR) from which it developed. This lecture will also highlight the speaker's positive and negative experiences in running MIREX as a jumping off point for outlining the motivation for developing a more robust and sophisticated Networked Environment for Music Analysis (NEMA) architecture for the future sustainability of the MIREX program.

More information about MIREX is available can be found below:

Downie, J. Stephen (2008). The Music Information Retrieval Evaluation Exchange (2005-2007): A window into music information retrieval research. Acoustical Science and Technology 29 (4): 247-255. Available at: http://dx.doi.org/10.1250/ast.29.247

A recent article in the Philadelphia Inquirer:

Avril, Tom. 2008. Analyzing music the digital way: Computers have exquisite ears. Philadelphia Inquirer (Sept. 22, 2008). Available: http://www.philly.com/inquirer/entertainment/20080922_Computers_have_exquisite_ears.html

Bio: Dr. J. Stephen Downie is an Associate Professor at the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign (UIUC). He is Director International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL). Professor Downie is Principal Investigator on the Networked Environment for Music Analysis (NEMA) and the Music-to-Knowledge (M2K) music data-mining projects. He has been very active in the establishment of the Music Information Retrieval and Music Digital Library communities through his ongoing work with the ISMIR series of MIR conferences as a member of the ISMIR steering committee.

3 November 2008, 10:30 AM, MC 5158 (Please note different location) 

Title: The Plumber, The Dragon, and the Princess: Growing a Web 2.0 Language from XQuery
Speaker: Jerome Simeon, IBM T.J. Watson Research Center
Abstract:

Data is the bloodline of modern Web applications. Be it for business analysis, feed aggregation, or to build a domain-specific social network, data must flow from sources and services to the users and back. And who knows more about how to make data flow than a database plumber? In the face of the increasing need for agile Web development, new languages (Ruby on Rails, Linq, Links, and XQuery to name a few) have emerged that integrate native data processing capabilities along with more traditional programming language features.

In this talk, I will argue that supporting efficient data processing in the context of a full-fledge programming language raises difficult, and largely unexplored, research questions. I will then present the DXQ project, which extends XQuery (the standard language for querying XML) with modules, imperative features, Web services support, and distribution. I will then illustrate how DXQ can support rapid development of Web applications, using three distinct examples: the DNS (Domain Name Server) protocol, the Narada overlay network protocol, and workflow-based processing of RSS feeds. If time permits, I will show how to adapt key existing database optimization techniques to such a language.

This is joint work with Mary Fernandez (AT&T Labs), Giorgio Ghelli (Universita di Pisa), Trevor Jim (AT&T Labs), Kristi Morton (University of Washington), Nicola Onose (UCSD), and Kristoffer Rose (IBM Watson).

Bio: Jerome Simeon is a Researcher for the Scalable XML Infrastructure Group at IBM T.J. Watson. He holds a degree in Engineering from EcolePolytechnique, and a Ph.D. from Universite d'Orsay. Previously, Jerome worked at INRIA from 1995 to 1999, and Bell Laboratories from 1999 to 2004. His research interests include databases, programming languages, compilers, and semantics, with a focus on Web development. He has put his work into practice in areas ranging from telecommunication infrastructure, to music. He is a co-editor for five of the W3C XML Query specifications, and has published more than 50 papers in scientific journals and international conferences. He is also a project lead for the Galax open-source XQuery implementation, and a co-author of "XQuery from the Experts" (Addison Wesley, 2004).

17 November 2008, 10:30 AM

Title: Managing Compliance Data: Addressing the Insider Threat Exemplified by Enron
Speaker: Marianne Winslett, University of Illinois at Urbana-Champaign
Abstract:

Financial misstatements from Fortune 500 companies, dead people who still vote in elections, world-class gymnasts with uncertain birth dates: insiders often have the power and ability to tamper with electronic records to further their own goals. As electronic records supplant paper records, it becomes easy to carry out such manipulations without leaving a paper trail of illicit activities that can be used to track and prosecute offenders after the fact. The US Sarbanes-Oxley Act is perhaps the most (in)famous legislation to target these abuses, but it is just one of many regulations that mandate long-term tamper-free or tamper-evident retention of electronic records, all with the goal of maintaining societal trust in business and government at reasonable cost.

In this talk, we will discuss some of the technical challenges posed by the need for "term-immutable" retention of records. We will describe how industry has responded to these challenges, the security weaknesses in current product offerings, and the role that researchers and government can play in addressing these weaknesses. We will give an overview of research progress to date and describe the major open research problems in this area.

Bio: Marianne Winslett is a research professor in the Department of Computer Science at University of Illinois at Urbana-Champaign. Winslett's research interests lie in information security and in the management of scientific data. She received a Presidential Young Investigator Award from the National Science Foundation and two best paper awards for research on managing compliance data. She has served on the editorial boards of ACM Transactions on the Web, ACM Transactions on Database Systems and and IEEE Transactions on Knowledge and Data Engineering. She is an ACM Fellow and a former vice-chair of ACM SIGMOD.

Wednesday, 10 December 2008, 10:30 AM

Title: Orchestra: Sharing Inconsistent Data in a Consistent Way
Speaker: Zack Ives, University of Pennsylvania
Abstract: One of the most pressing needs in business, government, and science is to bring together structured data from a variety of systems, formats, and terminologies. For instance, the emerging field of systems biology seeks to unify biological data to get a big-picture view of the processes within living organisms. Many organizations have set up databases designed to be "clearing houses" for specific types of information: each is separately maintained, cleaned, and curated, and has its own schema and terminology. Updates are constantly made as hypothesized relationships are confirmed or refuted, or new discoveries are made. The different databases contain complementary information that must be integrated to get a complete picture - and each database may have data of different quality or relevance to a domain. However, there is often no consensus on what the definitive answers are - each site may have different beliefs.

The Orchestra project focuses on how to support exchange of data (and updates) among collaborators with evolving databases, in a way that accommodates disagreement, different schemas, and different levels of authority and quality. Orchestra considers collaborators' databases to be logical *peers* into which data can be imported and then locally modified. It allows for a network of *schema mappings* that interrelate peers, annotated with *trust policies* specifying the conditions under which a peer is willing to import data. As a data item is mapped from site to site in the system, its *provenance* is recorded; a peer's trust policies use this provenance (and the values of the data) to assign a score to each incoming data item (based on perceived quality or relevance), and the peer then uses this score to reconcile conflicts and compute a consistent data instance, whose contents may be unique to the peer. The scores assigned to the individual sources can even be *learned* based on user feedback about query answers. The end result is a system that allows each database to selectively diverge from the others as appropriate, but to remain "in sync" in all other cases.

Joint work with Todd J. Green, Grigoris Karvounarakis, Nicholas Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Fernando Pereira, and Sudipto Guha.

Bio: Zachary Ives is an Assistant Professor at the University of Pennsylvania and an Associated Faculty Member of the Penn Center for Bioinformatics. He received his B.S. from Sonoma State University and his PhD from the University of Washington. His research interests include data integration, peer-to-peer models of data sharing, processing and security of heterogeneous sensor streams, and data exchange between autonomous systems. He is a recipient of the NSF CAREER award and a member of the 2006 (first) DARPA Computer Science Study Panel. He has been a co-program chair for the XML Symposium (2006) and New Trends in Information Integration (2008) workshops.

9 February 2009, 10:30 AM

Title: Jaql: Placing Pipes in the Clouds
Speaker: Kevin Beyer, IBM Almaden Research Center
Abstract: We introduce Jaql, a query language for the JSON data model. JSON (JavaScript Object Notation) is a popular data format for many Web-based applications because of its simplicity and modeling flexibility. JSON easily models a wide spectrum of data, ranging from homogenous flat data to heterogeneous nested data, and does this in a language-independent format that easily integrates with existing programming languages. We believe that these characteristics make JSON an ideal data format for many Hadoop applications and databases in general. This talk will describe the key features of Jaql and show how it can be used to process JSON data in parallel using Hadoop's map/reduce framework. The talk is intended for a broad computer science audience and includes background on map/reduce and Hadoop.
Bio: Kevin Beyer is a Research Staff Member at the IBM Almaden Research Center. His research interests are in information management, including query languages, analytical processing, and indexing techniques. He has been designing and implementing Jaql, in one form or another, for the past several years. Previously, he led the design and implementation of the XML indexing support in DB2 pureXML.

27 April 2009, 10:30 AM, MC 5158 (Please note different location) 

Title: Achieving and Understanding Linking in the Open Data Cloud
Speaker: Mariano Consens, University of Toronto
Abstract:

The Linking Open Data (LOD) community project is extending the Web by encouraging the creation of interlinks (RDF links between data items identified using dereferenceable URIs). This emerging web of linked data is closely intertwined with the existing web, since data items can be embedded into web documents, and RDF links can reference classic web pages. Abundant linked data justifies extending the capabilities of web browsers and search engines, and enables new usage scenarios, novel applications, and sophisticated mashups. This promising direction for publishing data on the web brings forward a number of challenges. While existing data management techniques can be leveraged to address the challenges, there are unique aspects to managing web scale interlinking.

In this talk, we describe two specific challenges; achieving and managing dense interlinking, and understanding the data and metadata that is used within and across datasets in the open data cloud. Approaches to solve these challenges are showcased in the context of LinkedMDB, the first open linked dataset for movies, and the winner of the Triplification Challenge. LinkedMDB has a large number of interlinks (over a quarter million) to several datasets in the LOD cloud, as well as RDF links to related webpages.

Bio: Mariano Consens' research interests are in the areas of Data Management and the Web, with a focus on linked data, XML searching, analytics for semistructured data, and autonomic systems. He has over 40 publications, including journal publications selected from best conference papers and several patents and patent applications. Mariano received his PhD and MSc degrees in Computer Science from the University of Toronto. He also holds a Computer Systems Engineer degree from the Universidad de la Republica, Uruguay. Consens has been a faculty member in Information Engineering at the MIE Department, University of Toronto, since 2003. Before that, he was research faculty at the School of Computer Science, University of Waterloo, from 1994 to 1999. In addition, he has been active in the software industry as a founder and CTO of a couple of software start-ups, and is currently a Visiting Scientist at the IBM Center for Advanced Studies in Toronto.

11 May 2009, 10:30 AM

Title: Pig: High-Level Dataflow on top of Map-Reduce
Speaker: Chris Olston, Yahoo! Research
Abstract:

Increasingly, organizations capture, transform and analyze enormous data sets. The most prominent examples are Internet companies and e-science. The Map-Reduce scalable dataflow engine has become a popular platform for these applications. Its simple, explicit dataflow programming model is favored by
some over the more traditional high-level declarative approach: SQL. On the other hand, the extreme simplicity of Map-Reduce leads to much low-level hacking to deal with the many-step, branching dataflows that arise in practice. Moreover, users must repeatedly code standard operations such as Join by hand. These practices waste time, introduce bugs, harm readability, and impede optimizations.

We introduce a new dataflow programming language called Pig Latin, which aims at a sweet-spot between SQL and Map-Reduce. It offers SQL-style high-level data manipulation constructs, which can be assembled in an explicit dataflow and interleaved with custom Map- and Reduce-style functions or binaries. Pig Latin also offers nested data structures and other features that appeal to non-SQL programmers.

This talk introduces Pig Latin and its open-source implementation, Pig, which compiles Pig Latin programs into sequences of Hadoop Map-Reduce jobs. In addition to describing the language and philosophy behind Pig, we present some of our research on cross-program optimization and automatically-generated example data sets.

Bio: Christopher Olston is a senior research scientist at Yahoo! Research, working in the areas of data management and web search. Olston is occasionally seen behaving as a professor, and has taught undergrad and grad courses at Berkeley, Carnegie Mellon and Stanford. He received his Ph.D. in 2003 from Stanford under fellowships from the university and the National Science Foundation. His Bachelor's degree is from Berkeley with highest
honors. Olston is an avid Cal fan but likes to rollerblade at Stanford.

8 June 2009, 10:30 AM

Title: Declarative Secure Distributed Systems
Speaker: Boon Thau Loo, University of Pennsylvania
Abstract:

In this talk, we present our recent work on using declarative languages to specify, implement, and analyze secure distributed systems. In the first half of the talk, we describe Secure Network Datalog (SeNDlog), a declarative language that unifies declarative networking and logic-based access control languages. SeNDlog enables network routing, distributed systems, and their security policies to be specified and implemented within a common declarative framework. We describe extensions to distributed recursive query processing techniques to execute SeNDlog programs via authenticated communication among untrusted nodes. In the second half of the talk, we introduce the notion of network provenance naturally captured within our declarative framework, and demonstrate its applicability in the areas of network accountability, network forensic analysis and trust management. We further discuss our ongoing work at optimizing distributed query processors in order to process and maintain network provenance efficiently and securely.

To conclude the talk, we will briefly describe other declarative networking research at Penn, in the areas of adaptive wireless networking and formal network verification.

Bio: Boon Thau Loo is an Assistant Professor in the Computer and Information Science department at the University of Pennsylvania. He received his Ph.D. degree in Computer Science from the University of California at Berkeley in 2006. His research focuses on distributed data management systems, Internet-scale query processing, and the application of data-centric techniques and formal methods to the design, analysis and implementation of networked systems. He was awarded the ACM SIGMOD dissertation award (2007) and the NSF CAREER award (2009). He has been the program co-chair for the CoNEXT 2008 Student Workshop and is the current co-chair of the NetDB 2009 workshop co-located with SOSP.