Database Seminar Series (2010-2011) | Data Systems Group

The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2010-2011 are below.

The talks are usually held on a Wednesday at 2:30 pm. Unless otherwise noted, all talks will be in room DC (Davis Centre) 1302. Coffee will be served 30 minutes before the talk.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes (usually in PDF format).

The Database Seminar Series is supported by Sybase iAnywhere.

Kelly Lyons

Alkis Polyzotis

Goetz Graefe

Bettina Kemme

Daniel Abadi

Divyakant Agrawal

Shivnath Babu

Renee Miller

Avigdor Gal

Andreas Thor

29 September 2010, 3:00 pm (Please note change in starting time)

Title:	Mediating Human-to-Human Interactions through Social Media Technology (PDF)
Speaker:	Kelly Lyons, University of Toronto
Abstract:	With the increase in globalization, an increasing number of companies conduct business across distance. Companies are distributed across geographic boundaries, time zones, and individual knowledge workers are working from home or, otherwise, telecommuting. Companies are choosing to save money by reducing the number of meetings involving travel and are further motivated by a desire to reduce their carbon footprints. This means that many aspects of business engagement between people (including research collaborations, decision making, and even software development) are taking place over distance supported by some combination of technologies including teleconferences, video conferences, electronic meeting software, and various collaborative platforms. In this talk, we present current research projects looking into how social media technology can support human interactions over distance.
Bio:	Kelly Lyons is an Associate Professor in the Faculty of Information at the University of Toronto. Prior to joining the Faculty of Information, she was the Program Director of the IBM Toronto Lab Centre for Advanced Studies (CAS). Her current research interests include service science, social computing, and collaboration. Currently, she is focusing on technologies, work practices, and business models that support and mediate human-to-human interactions in service systems. Kelly holds a cross-appointment with the Department of Computer Science at the University of Toronto, is a member of the University of Toronto's Knowledge Media Design Institute, is an IBM Faculty Fellow, Member-at-Large of the ACM Council, and a member of the ACM-W Executive Committee. More details can be found on her webpage at: http://individual.utoronto.ca/klyons

Thursday, 14 October 2010, 2:30 pm, DC 1304 (Please note change in day and room)

Title:	Semi-Automatic Index Tuning for Database Systems
Speaker:	Alkis Polyzotis, University of California, Santa Cruz
Abstract:	Database systems rely heavily on indexes in order to achieve good performance. Selecting the appropriate indexes is a difficult optimization problem, and modern database systems are equipped with automated methods that recommend indexes based on some type of workload analysis. Unfortunately, current methods either require advanced knowledge of the database workload, or force the administrator to relinquish control of which indices are created. This talk will summarize our recent work in semi-automatic index tuning, a novel index recommendation technique that addresses the shortcomings of previous methods. Semi-automatic tuning leverages techniques from online optimization, which allows us to prove strong bounds on the quality of its recommendations. The experimental results show that semi-automatic tuning outperforms previous methods by a large margin, offering index recommendations that achieve close to optimal savings in workload evaluation time.
Bio:	Neoklis Polyzotis is currently an associate professor at UC Santa Cruz. His research focuses on database systems, and in particular on on-line database tuning, scientific data management, and cloud computing. He is the recipient of an NSF CAREER award in 2004 and of an IBM Faculty Award in 2005 and 2006. He has also received the runner-up for best paper in VLDB 2007 and the best newcomer paper award in PODS 2008. He received his PhD from the University of Wisconsin at Madison in 2003.

1 December 2010, 2:30 pm, MC 5136 (Please note change in room)

Title:	A New Join Algorithm
Speaker:	Goetz Graefe, HP Labs
Abstract:	Database query processing traditionally relies on three alternative join algorithms: index nested loops join exploits an index on its inner input, merge join exploits sorted inputs, and hash join exploits differences in the sizes of the join inputs. Cost-based query optimization chooses the most appropriate algorithm for each query and for each operation. Unfortunately, mistaken algorithm choices during compile-time query optimization are common yet expensive to investigate and to resolve. Our goal is to end mistaken choices among join algorithms by replacing the three traditional join algorithms with a single one. Like merge join, this new join algorithm exploits sorted inputs. Like hash join, it exploits different input sizes for unsorted inputs. In fact, for unsorted inputs, the cost functions for recursive hash join and for hybrid hash join have guided our search for the new join algorithm. In consequence, the new join algorithm can replace both merge join and hash join in a database management system. The in-memory components of the new join algorithm employ indexes. If the database contains indexes for one (or both) of the inputs, the new join can exploit persistent indexes instead of temporary in-memory indexes. Using database indexes to match input records, the new join algorithm can also replace index nested loops join.
Bio:	Goetz Graefe is a member of the Information Analytics Lab within Hewlett-Packard Laboratories. His experience and expertise are focused on database management systems, gained in academic research, industrial consulting, and industrial product development. Goetz's areas of expertise within database management systems cover compile-time query optimization including extensible query optimization, run-time query execution including parallel query execution, indexing, and transactions. He has also worked on transactional memory, specifically techniques for software implementations of transactional memory. Goetz pursued undergraduate studies in business and in computer science at multiple German universities. In 1983, he was admitted to the University of Wisconsin - Madison, where he was granted a MS degree in 1984 and a Ph.D. in 1987.

23 March 2011, 2:30 pm, MC 6005 (Please note change in room)

Title:	Data Consistency in Scalable Multi-tier Architectures
Speaker:	Bettina Kemme, McGill University
Abstract:	Most transactional e-commerce applications are implemented in multi-tier architectures where the application tier implements business logic and the database tier maintains persistent data. Each of the tiers might be replicated for performance. The application tier usually coordinates transaction execution across tiers, and caches frequently accessed data. Often, it has its own concurrency control mechanism that provides various degrees of isolation offering a trade-off between consistency and performance. While developers are aware that choosing lower levels of isolation might lead to inconsistencies, there is often no understanding how often they occur. Furthermore, inconsistencies are often detected very late, where reconciliation becomes an expensive task. Finally, although a multi-component system might claim to offer a certain level of isolation, it might actually fail to do so, as distribution aspects are often not taken into account. In this talk, I will present approaches that address these issues. I will present solutions that automatically detect, quantify and categorize consistency anomalies during run-time of multi-tier applications. These approaches do not need to know anything about the applications themselves, and are fully implemented in the application server tier, in our case, JEE-conform servers. I will also discuss application server distribution strategies for various levels of isolation.
Bio:	Bettina Kemme is an Associate Professor at the School of Computer Science of McGill University, Montreal where she leads the distributed information systems lab. She holds degrees in Computer Science from ETH Zurich (PhD) and Friedrich-Alexander-Universitaet Erlangen, Germany (Inf.-Diplom). She was recipient of the VLDB 10-year paper award in 2010. Her research focuses on large-scale data management with a main focus on data consistency.

6 April 2011, 2:30 pm

Title:	Scalable Database Systems for a Machine-Dominated World
Speaker:	Daniel Abadi, Yale University
Abstract:	As machines slowly replace humans as the primary source of data generation and transaction initiators, we enter a new era where data generation and transaction processing increases at the speed of Moore's law, permanently creating a need for scalable data management systems. In this talk, I will present the architecture of two scalable data management systems we have built in my group at Yale: the first is a scalable system optimized for data analysis called HadoopDB that attempts to combine the scalability of batch-processing systems such as Hadoop with the interactive performance of parallel database systems. The talk will overview the ideas from the initial paper on HadoopDB, and then will discuss some recent developments. The second system is designed for scalable transactional processing, with a particular focus on the hard problem of achieving high throughput in non-partitionable workloads. The basic idea is to replace the concurrency control component of database systems with a deterministic protocol, and use multiple such systems as building blocks for scalable transaction processing. Such an approach enables low-cost consistent replication while improving transactional throughput by eliminating two-phase commit. I will present the basic architecture of the system in addition to some promising early results on transactional processing benchmarks (i.e., TPC-C).
Bio:	Daniel Abadi is an assistant professor of computer science at Yale University. Before joining the Yale faculty three and a half years ago, he spent four years at the Massachusetts Institute of Technology where he received his Ph.D. Abadi has been a recipient of a Churchill Scholarship, an NSF CAREER Award, the 2008 SIGMOD Jim Gray Doctoral Dissertation Award, and the 2007 VLDB best paper award. He tweets at @daniel_abadi.

4 May 2011, 2:30 pm

Title:	Elastic Scalability of Data-intensive Applications in the Cloud
Speaker:	Divyakant Agrawal, University of California, Santa Barbara
Abstract:	Over the past two decades, database and systems researchers have made significant advances in the development of algorithms and techniques to provide data management solutions that carefully balance the three major requirements when dealing with critical data: high availability, reliability, and data consistency. However, over the past few years the data requirements, in terms of data availability and system scalability, from Internet scale enterprises that provide services and cater to millions of users have been unprecedented. Cloud computing has emerged as an extremely successful paradigm for deploying Internet and Web-based applications. Scalability, elasticity, pay-per-use pricing, and autonomic control of large-scale operations are the major reasons for success and widespread adoption of cloud infrastructures. Current proposed solutions to scalable data management, driven primarily by prevalent application requirements, significantly downplay the data consistency requirements and instead focus on high scalability and resource elasticity to support data-rich applications for millions to tens of millions of users. In particular, the "newer" data management systems limit consistent access only at the granularity of single objects, rows, or keys, thereby significantly trading-off consistency in order to achieve very high scalability and availability. But the growing popularity of "cloud computing", the resulting shift of a large number of Internet applications to the cloud, and the quest towards providing data management services in the cloud, has opened up the challenge for designing data management systems that provide consistency guarantees at a granularity which goes beyond single rows and keys. In this talk, we analyze the design choices that allowed modern scalable data management systems to achieve orders of magnitude higher levels of scalability compared to traditional databases. With this understanding, we highlight some design principles for data management systems that can be used to augment existing databases with new cloud features such as scalability, elasticity, and autonomy. In this talk we present recent advances that have been made to strike a middle-ground between the two radically different data management architectures: traditional database management systems where the data is treated as a "whole" versus modern key-value stores where data is treated as a collection of independent "granules".
Bio:	Dr. Divyakant Agrawal is a Professor of Computer Science at the University of California at Santa Barbara. His research expertise is in the areas of database systems, distributed computing, data warehousing, and large-scale information systems. Dr. Agrawal served as the Chair of Computer Science Department at UCSB from 1999 to 2003. From January 2006 through December 2007, Dr. Agrawal served as VP of Data Solutions and Advertising Systems at the Internet Search Company ASK.com. Dr. Agrawal has also served as a Visiting Senior Research Scientist at the NEC Laboratories of America in Cupertino, CA from 1997 to 2009. During his professional career, Dr. Agrawal has served on numerous Program Committees of International Conferences, Symposia, and Workshops and served as an editor of the journal of Distributed and Parallel Databases (1993-2008), the VLDB journal (2003-2008) and currently serves on the editorial boards of the Proceedings of the VLDB and ACM Transactions on Database Systems. He recently served as the Program Chair of the 2010 ACM International Conference on Management of Data and served as the General Chair of the 2010 ACM SIGSPATIAL Conference on Advances in Geographical Information Systems. Dr. Agrawal organized an NSF Workshop on the Science of Cloud Computing in March’2011, is serving as the General Co-Chair of ACM SIGSPATIAL Conference on Advances in GIS (ACM GIS’2011), and is serving as the Program Co-Chair of ACM Workshop on Large Scale Distributed Systems and Middleware (ACM LADIS’2011). Dr. Agrawal's research philosophy is to develop data management solutions that are theoretically sound and are relevant in practice. He has published 300+ research manuscripts in prestigious forums (journals, conferences, symposia, and workshops) on wide range of topics related to data management and distributed systems and has advised more than 30 Doctoral students during his academic career. Recently, Dr. Agrawal has been recognized as an Association of Computing Machinery (ACM) Distinguished Scientist. His current interests are in the area of scalable data management and data analysis in Cloud Computing environments, security and privacy of data in the cloud, and scalable analytics over social networks data and social media.

Monday, 27 June 2011, 10:30 am (Please note change in day and starting time)

Title:	MADDER and Self-Tuning Data Analytics on Hadoop with Starfish (PDF)
Speaker:	Shivnath Babu, Duke University
Abstract:	Timely and cost-effective analytics over "big data" is now a key ingredient for success in many businesses, scientific and engineering disciplines, and government endeavors. The Hadoop software stack— which consists of an extensible MapReduce execution engine, pluggable distributed storage engines, and a range of procedural to declarative interfaces — is a popular choice for big data analytics. Most practitioners of big data analytics — like computational scientists, systems researchers, and business analysts — lack the expertise to tune the system to get good performance. Unfortunately, Hadoop's performance out of the box leaves much to be desired, leading to suboptimal use of resources, time, and money (in pay-as-you-go clouds). We introduce Starfish, a self-tuning system for big data analytics. Starfish builds on Hadoop while adapting to user needs and system workloads to provide good performance automatically, without any need for users to understand and manipulate the many tuning knobs in Hadoop. While Starfish's system architecture is guided by work on self-tuning database systems, we discuss how new analysis practices (dubbed the MADDER principles) over big data pose new challenges; leading us to different design choices in Starfish.
Bio:	Shivnath Babu is an Assistant Professor of Computer Science at Duke University. He got his Ph.D. from Stanford University in 2005. He has received a U.S. National Science Foundation CAREER Award and three IBM Faculty Awards. His research interests include making data-intensive computing systems easier to manage, automated cluster sizing and problem diagnosis for systems running on cloud platforms, as well as automated detection and recovery from data corruption caused by hardware faults, software bugs, or human mistakes.

29 June 2011, 2:30 pm

Title:	On Schema Discovery
Speaker:	Renee Miller, University of Toronto
Abstract:	Structured data is distinguished from unstructured data by the presence of a schema describing the logical structure and semantics of the data. The schema is the means through which we understand and query the underlying data. Schemas enable data independence. In this talk, I consider a few problems related to the discovery and maintenance of schemas. I'll discuss the changing role of schemas from prescriptive to descriptive and new applications of schemas in data curation and data quality. This talk is based on joint work with Fei Chiang, Periklis Andritsos, and Oktie Hassanzadeh.
Bio:	Renée J. Miller is a professor of computer science and the Bell Canada Chair of Information Systems at the University of Toronto. She received the US Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the United States government on outstanding scientists and engineers beginning their careers. She received an NSF CAREER Award, the Premier's Research Excellence Award, and an IBM Faculty Award. She is a fellow of the ACM. Her research interests are in the efficient, effective use of large volumes of complex, heterogeneous data. This interest spans data integration and exchange, inconsistent and uncertain data management, and data curation and cleaning. She serves on the Board of Trustees of the VLDB Endowment and was elected to serve as VLDB President beginning January 2010. She is also serving as the PC Chair for SIGMOD 2011. She leads a Canada-wide Strategic Research Network on Business Intelligence. She received her PhD in Computer Science from the University of Wisconsin, Madison and bachelor's degrees in Mathematics and Cognitive Science from MIT.

6 July 2011, 2:30 pm, MC 5136 (Please note change in room)

Title:	Uncertain Schema Matching: the Power of not Knowing
Speaker:	Avigdor Gal, Technion
Abstract:	Schema matching is the task of providing correspondences between concepts describing the meaning of data in various heterogeneous, distributed data sources. Schema matching is one of the basic operations required by the process of data and schema integration, and thus has a great effect on its outcomes, whether these involve targeted content delivery, view integration, database integration, query rewriting over heterogeneous sources, duplicate data elimination, or automatic streamlining of workflow activities that involve heterogeneous data sources. Although schema matching research has been ongoing for over 25 years, only recently a realization has emerged that schema matchers are inherently uncertain. Since 2003, work on the uncertainty in schema matching has picked up, along with research on uncertainty in other areas of data management. This lecture presents the benefits of modelling schema matching as an uncertain process and shows a single unified framework for it. We also briefly cover two common methods that have been proposed to deal with uncertainty in schema matching, namely ensembles and top-K matchings, and discuss the applicability of this research to NisB, a European project offering a toolkit for enterprize integration. The talk is based on a recent manuscript, part of the Synthesis Lectures on Data Management by Morgan & Claypool.
Bio:	Avigdor Gal is an Associate professor at the Faculty of Industrial Engineering & Management at the Technion - Israel Institute ofTechnology . He received his D.Sc. degree from the Technion in 1995 in the area of temporal active databases. He has published more than 95 papers in journals (e.g. Journal of the ACM (JACM), ACM Transactions on Database Systems (TODS), IEEE Transactions on Knowledge and Data Engineering (TKDE), ACM Transactions on Internet Technology (TOIT), and the VLDB Journal), books (Schema Matching and Mapping) and conferences (ICDE, ER, CoopIS, BPM) on the topics of data integration, temporal databases, information systems architectures, and active databases. Avigdor is a member of CoopIS (Cooperative Information Systems) Advisory Board, a member of IFIP WG 2.6, and a recipient of the IBM Faculty Award for 2002-2004. He is a member of the ACM and a senior member of IEEE. Avigdor served as a Program co-Chair and General Chair of CoopIS and DEBS, and in various roles in ER and CIKM. He served as a program committee member in SIGMOD, VLDB, ICDE and others. Avigdor is an Area Editor of the Encyclopedia of Database Systems.

13 July 2011, 2:30 pm, MC 2018B (Please note change in room)

Title:	Data Integration in the Cloud
Speaker:	Andreas Thor, University of Maryland
Abstract:	Cloud computing has become a popular paradigm for efficiently processing computationally and data-intensive tasks. Such tasks can be executed on demand on powerful distributed hardware and service infrastructures. The parallel execution of complex tasks is facilitated by different programming models (e.g., MapReduce), distributed data stores, and the ability to employ computing capacity on demand. Data integration can notably benefit from cloud computing because accessing multiple data sources and integration of instance data are usually expensive tasks. In the first part of the talk we introduce CloudFuice, a data integration system that follows a mashup-like specification of advanced data flows for data integration. CloudFuice's task-based execution approach allows for an efficient, asynchronous, and parallel execution of data flows in the cloud and utilizes recent cloud-based web engineering instruments. The second part of the talk deals with the effectiveness and scalability of MapReduce-based implementations for entity resolution. In the presence of skewed data, sophisticated redistribution approaches become necessary to achieve load balancing among all reduce tasks to be executed in parallel. The proposed approaches support blocking techniques to reduce the search space of entity resolution and effectively distribute the entities of large blocks among multiple reduce tasks.
Bio:	Andreas Thor (http://dbs.uni-leipzig.de/de/person/andreas_thor) received a Diploma and a Ph.D. in Computer Science in 2002 and 2008, respectively, from the University of Leipzig, Germany. He holds an appointment as Research Scientist with the database group in Leipzig. Andreas is currently a visiting research scientist at University of Maryland Institute for Advanced Computer Studies. Andreas' research areas deal with integration of web data sources. More specifically, he has been working on approaches for entity resolution, ontology alignment, and flexible integration architectures.