Database Seminar Series (2011-2012)

The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2011-2012 are below.

The talks are usually held on a Wednesday at 2:30 pm in room DC (Davis Centre) 1302. Coffee will be served 30 minutes before the talk.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes (usually in PDF format).


The Database Seminar Series is supported by Sybase iAnywhere.


Jonathan Goldstein
Alon Halevy
Molham Aref
Ryan Johnson
Martin Kersten

28 September 2011, 2:30 pm, DC 1302

Title: Temporal Analytics on Big Data for Web Advertising
Speaker: Jonathan Goldstein, Microsoft
Abstract: Work by: Badrish Chandramouli, Jonathan Goldstein, and Songyun Duan

"Big Data" stored in map-reduce (M-R) clusters is often fundamentally temporal in nature, as are many analytics tasks over such data. For instance, display advertising uses Behavioral Targeting (BT) to select ads for users based on prior searches, page views, etc. Prior work on BT has focused on techniques that scale well for offline data using M-R. However, this approach has limitations for BT-style applications that deal with temporal data: (1) many queries are temporal and not easily expressible in M-R, and moreover, the set-oriented nature of M-R front-ends such as SCOPE is not suitable for temporal processing; (2) as commercial systems mature, they may need to also directly analyze and react to real-time data feeds since a high turnaround time can result in missed opportunities, but it is difficult for current solutions to naturally also operate over real-time streams.

We therefore propose a novel framework called TiMR (pronounced timer), that combines a time-oriented data processing system with a M-R framework. Users write and submit analysis algorithms as temporal queries — these queries are succinct, scale-out-agnostic, and easy to write. They scale well on large-scale offline data using TiMR, and can work unmodified over real-time streams. We also propose new cost-based query fragmentation and temporal partitioning schemes for improving efficiency with TiMR. Second, we show the feasibility of this approach for BT, with new temporal algorithms that exploit new targeting opportunities. Experiments using real data from a commercial ad platform show that TiMR is very efficient and incurs orders-of-magnitude lower development effort. Our BT solution is easy and succinct, and performs up to several times better than current schemes in terms of memory, learning time, and click-through-rate/coverage.

Bio: Jonathan Goldstein currently does R&D for Microsoft StreamInsight, a product based on the CEDR research project at Microsoft Research, which he led for the 4 years prior to joining the product team. Microsoft StreamInsight is a streaming product based on the CEDR algebra and CEDR query processing algorithms. Prior to working on streaming, Jonathan worked on query optimization, audio fingerprinting, similarity search, and database compression. His work on similarity searching has been recognized in the SIGMOD influential papers anthology, and he was recently awarded the SIGMOD Test of Time award for his prior query optimization work. Jonathan received his B.S. from SUNY Stony Brook in 1993, and his Ph.D. from the University of Wisconsin in 1999.

12 October 2011, 2:00 pm, DC 1302 (Please note the changed start time)

Title: Bringing (Web) Databases to the Masses
Speaker: Alon Halevy, Google Research
Abstract: The World-Wide Web contains vast quantities of structured data on a variety of domains, such as hobbies, products and reference data. Moreover, the Web provides a platform that can encourage publishing more data sets from governments and other public organizations and support new data management opportunities, such as effective crisis response, data journalism and crowd-sourcing data sets. To enable such wide-spread dissemination and use of structured data on the Web, we need to create a ecosystem that makes it easier for users to discover, manage, visualize and publish structured data on the Web.

I will describe some of the efforts we are conducting at Google towards this goal and the technical challenges they raise. In particular, I will describe Google Fusion Tables, a service that makes it easy for users to contribute data and visualizations to the Web, and the WebTables Project that attempts to discover high-quality tables on the Web and provide effective search over the resulting collection of 200 million tables.

Bio: Alon Halevy heads the Structured Data Management Research group at Google. Prior to that, he was a professor of Computer Science at the University of Washington in Seattle, where he founded the database group. In 1999, Dr. Halevy co–founded Nimble Technology, one of the first companies in the Enterprise Information Integration space, and in 2004, Dr. Halevy founded Transformic Inc., a company that created search engines for the deep web, and was acquired by Google. Dr Halevy is a Fellow of the Association for Computing Machinery, received the the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2000, and was a Sloan Fellow (1999–2000). He received his Ph.D in Computer Science from Stanford University in 1993. He is also a bit of a coffee nut.

30 November 2011, 2:30 pm, DC 1302

Title: Datalog for Enterprise Software: from Industrial Applications to Research
Speaker: Molham Aref, LogicBlox
Abstract: LogicBlox is a platform for the rapid development of enterprise applications in the domains of decision automation, analytics, and planning. Although the LogicBlox platform embodies several components and technology decisions (e.g., an emphasis on software-as- a-service), the key substrate and glue is an implementation of the Datalog language. All application development on the LogicBlox platform is done declaratively in Datalog: the language is used to query large data sets, but also to develop web and desktop GUIs (with the help of pre-defined libraries), to interface with solvers, statistics tools, and optimizers for complex analytics solutions, and to express the overall business logic of the application. The goal of this talk is to present both the business case for Datalog and the fruitful interaction of research and industrial applications in the LogicBlox context.
Bio: Molham has spent the last 20 years developing, bringing to market, and implementing enterprise-grade analytic, predictive, optimization and simulation solutions for the demand chain, supply chain, and revenue management across various industries including retail, wireless, and financial services. Prior to LogicBlox, Molham spent nine years at HNC Software and then at its subsidiary Retek, where his primary responsibility was to lead the development, sales, and implementation of various analytical and predictive solutions for supply chain and revenue management. Molham started his career working as a software engineer and scientist at AT&T, primarily in the area of image understanding and computer vision. Molham holds a bachelor's degree in computer engineering, a master's degree in electrical engineering, and a masters degree in computer science, all from the Georgia Institute of Technology.

25 January 2012, 2:30 pm, DC 1302

Title: Communication and co-design for scalable database engines
Speaker: Ryan Johnson, University of Toronto
Abstract: Multicore hardware poses an ongoing challenge for software, which must extract ever-increasing degrees of parallelism in order to benefit from Moore's Law. Although database engines benefit from highly concurrent workloads, scalability bottlenecks within the DBMS itself conspire to limit scalability. Trends such as no-SQL and shared-nothing databases (sometimes hosted within a single machine!) are partly driven by the assumption that communication -- whether in the form of scheduling, lock conflicts, or coherence traffic -- poses a fundamental threat to scalability and must be categorically eliminated.

This talk observes that several types of communication patterns exist, not all of them harmful, and that unscalable communication patterns can often be converted to more helpful ones. Doing so can eliminate a wide variety of scalability bottlenecks, from the query scheduler all the way down to the operating system, while still allowing the convenience of a shared-everything system. Along the way, we uncover a second recurring theme: effective solutions increasingly tend towards co-design. The DBMS is a highly complex and layered piece of software, and changes (often fairly small ones) in one part of the system can allow disproportionate simplification and/or enhancement of other parts. Together, these two observations allow a shared-everything DBMS to scale to dozens of cores without giving up features such as durability or strong consistency.

Bio: Ryan Johnson is an Assistant Professor at the University of Toronto specializing in systems aspects of database engines. His pursuit of scalable database engines spans the database software stack, as well as touching on operating systems and computer architecture. With a background in computer engineering and embedded systems, he believes that big-O tells only half the story and that practical problem-solving at the systems level incorporates a wide variety of tools and techniques.

Monday, 9 April 2012, 2:30 pm, DC 1302 (Not on a Wednesday)

Title: Arrays in database systems,the next frontier?
Speaker: Martin Kersten, CWI
Abstract:

Scientific applications are still poorly served by contemporary relational database systems. At best, the system provides a bridge towards an external library using user-defined functions, explicit import/export facilities or linked-in Java/C# interpreters. Time has come to rectify this with SciQL, a SQL-query language for science applications with arrays as first class citizens.

In this talk I outline the language features using examples from seismology, astronomy and remote sensing. It demonstrates that a symbiosis of the relational and array paradigm is feasible and highly effective to cope with the data intensive research challenges encountered in science.

Subsequently, its ongoing implementation on top of MonetDB is described at some length, providing a vista on the many novel database research issues emerging.

Bio: Martin Kersten devoted most of his scientific career tothe development of database systems. The latest incarnation is the open-source system MonetDB (See http://www.monetdb.org), which demonstrates viability of the column-storage approach. The system is developed by the Database Architectures group of CWI, which he established in 1985, and which hosts a strong group of experimental scientists. Kersten is CWI research fellow and a full professor at the University of Amsterdam. He is a (co)author of >140 papers and recipient of multiple large (inter) national research grants to steer multi-media and scientific database research. He is a member emeritus of the VLDB Endowment.