Database Seminar Series (2012-2013)

The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2012-2013 are below.

The talks are usually held on a Wednesday at 2:30 pm in room DC (Davis Centre) 1302. Coffee will be served 30 minutes before the talk.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.


The Database Seminar Series is supported by Sybase iAnywhere.


Val Tannen
Gokul Soundararajan
Shivakumar Vaithyanathan
Arthur Zimek
Mohamed Mokbel
Michael Bender
Franz Faerber
Nick Koudas

19 September 2012, 2:30 pm, DC 1302

Title: Provisioning for What-If Analysis
Speaker: Val Tannen, University of Pennsylvania
Abstract:

Problems of what-if analysis (such as hypothetical deletions, insertions, and modifications) over complex analysis queries are increasingly commonplace, e.g., in forming a business strategy or looking for causal relationships in science. Here, data analysts are typically interested in only task-specific views of the data, and they expect to be able to interactively manipulate the data in a natural and seamless way - possibly on a phone or tablet, and possibly via a spreadsheet or similar interface without having to carry the full machinery of a DBMS.

The Caravan system enables what-if analysis, i.e., fast, lightweight, interactive exploration of alternative answers, within views computed over large-scale, possibly distributed data sources. Our novel approach is based on creating dedicated provisioned autonomous representations, or PARs. PARs are compiled out of the data, initial analysis queries and user-specified what-if scenarios and are based on c-tables and their recent generalization to aggregation. They allow rapid evaluation of what-if scenarios without accessing the original data or performing complex query operations. Importantly, the size of PARs is governed by the parameters of the what-if analysis and is proportional to the size of the initial query answer rather than the typically much larger source data. Consequently, many what-if analysis tasks performed through PAR evaluations can be done autonomously, on limited-resource devices. We describe our model and architecture, demonstrate preliminary performance results, and present several open implementation and optimization issues.

Joint work with Daniel Deutch (Ben-Gurion U), Zack Ives (UPenn), and Tova Milo (Tel Aviv U).

Bio: TBD

10 October 2012, 2:30 pm, DC 1331 (Not the usual room)

Title: MixApart: Decoupled Analytics for Shared Storage Systems
Speaker: Gokul Soundararajan, NetApp
Abstract: Data analytics and enterprise applications have very different storage functionality requirements. For this reason, enterprise deployments of data analytics are on a separate storage silo. This generates additional costs and inefficiencies in data management e.g., whenever data needs to be archived, copied, or migrated across silos. We design MixApart, a scalable data processing framework for shared enterprise storage systems. With MixApart, a single consolidated storage back-end manages enterprise data and services all types of workloads, thus simplifying data management and lowering hardware costs for enterprises. In addition, MixApart enables the local storage performance required by data analytics through an integrated data caching and scheduling solution. We expect that our decoupled, stateless cache design will be most useful for cross-data center deployments and for transparent, and consistent refresh of analytics data upon updates to underlying enterprise data. We evaluate MixApart using a 100-core Amazon EC2 cluster with micro-benchmarks and production workload traces. Our evaluation shows that MixApart provides i) up to 28% faster performance than the traditional ingest-then-compute workflows used in enterprise IT analytics, and ii) comparable performance to an ideal Hadoop setup without data ingest, at similar cluster sizes.
Bio: Gokul Soundararajan is a member of technical staff at NetApp. He currently works on techniques to improve the efficiency and management of dynamic data centers. Prior to joining NetApp, in 2010, Gokul received his PhD degree from the University of Toronto in Electrical and Computer Engineering. For his dissertation, Gokul developed novel techniques to improve performance and the manageability of multi-tenant shared data centers. Other projects he has worked on include Griffin, a system to improve the lifetimes of SSDs, and a database provisioning system. He received a bachelor of applied science degree in 2003 and a master of applied science degree in 2005 from the University of Toronto.

Tuesday, 6 November 2012, 1:00 pm, DC 1304 DC 2585 (Not the usual day of the week, time, or room)

Title: Social Media Data Analytics Research
Speaker: Shivakumar Vaithyanathan, IBM Almaden Research Center
Abstract: Social media is an interactive vehicle for communication used daily by hundreds of millions of people. Businesses can derive significant benefits from listening to what users are publicly saying on social media, by transforming the massive amounts of textual content into insights specific to enterprise applications. Defining, extracting and representing entities such as people, organizations and products, and their inter-relationships enables the building of comprehensive consumer profiles that can be leveraged in enterprise applications such as customer retention and acquisition, campaign management and lead generation. Building these social media profiles requires a combination of text analytics and entity resolution, while the utilization of such profiles in applications requires statistical models and machine learning. In this talk, I will describe the work in progress at IBM Research on how such consumer insights, both at the level of an individual and at the level of appropriate micro-segments, can be used in applications in companies ranging from movie studios to financial services and insurance companies. I will also provide a brief overview of text, entity and statistical analysis tools that can operate in a distributed environment over very large amounts of data.
Bio: Shivakumar Vaithyanathan is the IBM Chief Scientist for Text Analytics and the Department Manager of the Intelligent Information Systems Group at the IBM Almaden Research Center. Since joining IBM in 1998, he has been involved in multiple research areas including development of learning algorithms, especially for extremely high-dimensional sparse data. His department is currently involved in building systems for Scalable Unstructured Analytics, Enterprise Search and Large-scale machine learning and Statistical Modeling. Multiple technologies developed in his department currently ship with several IBM products including IBM's Big Data Products.

Prior to IBM, Shivakumar was part of the newly formed Altavista Group at Digital. Shivakumar was an invited keynote speaker at the 2011 German Database Conference and 2011 ACM SiGIR Industrial Track. He is also an Associate Editor of Journal of Statistical Analysis and Data Mining.

28 November 2012, 2:30 pm, DC 1302

Title: There and Back Again: Outlier Detection between Statistical Reasoning and Efficient Database Methods (PDF)
Speaker: Arthur Zimek, University of Alberta
Abstract: While classical statistical methods for outlier detection had a focus on probabilistic reasoning, research on outlier detection in the database context during the last decade focused on the development of ever more efficient methods to compute outlier scores without much reasoning about the meaning of these scores. In this talk, we sketch this development and introduce some methods that go back again to statistical reasoning on top of the efficient database techniques. As we demonstrate, this opens up new possibilities for the design of ensemble methods for outlier detection.
Bio:

Dr. Arthur Zimek is a postdoctoral fellow at the University of Alberta, Edmonton, AB, working with Prof. Dr. Joerg Sander. Formerly, he worked as scientific assistant in the database systems and data mining group of Hans-Peter Kriegel at Ludwig-Maximilians-University Munich, Germany. He finished his PhD thesis in computing science on ''Correlation Clustering'' in summer 2008.

He received the ''SIGKDD Doctoral Dissertation Award (runner-up)'' in 2009. He received the ''Best Paper Honorable Mention Award'' at SDM 2008 and the ''Best Demonstration Paper Award'' at SSTD 2011 together with his co-authors. His research interests include unsupervised data mining (clustering and outlier detection), especially for high dimensional data, and evaluation of unsupervised data mining methods. He serves as a reviewer for several top database and data mining journals (e.g. VLDB Journal, IEEE TKDE, ACM TKDD, Data Min. Knowl. Disc., Stat. Anal. Data Min.) and as a member of program committees in data mining conferences (e.g. ACM SIGKDD 2011, 2012, ECML PKDD 2012).

27 February 2013, 2:30 pm, DC 1302

Title: Database Support for Recommender Systems (PDF)
Speaker: Mohamed Mokbel, University of Minnesota
Abstract: Recommender systems have become popular in both academia and industry, where the main purpose is to suggest to users useful and interesting items or contents from a considerably large set of items. Recommender systems are implicitly employed on a daily basis to recommend movies (e.g., NetFlix), friends (e.g., Facebook), news articles (e.g., Google News), and books/products (e.g., Amazon). Over the last decade, the main focus of the recommender system community was mainly on quality of answers, while efficiency was put back as a secondary issue. In this talk, we show how recommender systems can benefit form database technologies to boost its performance and scalability. In particular, we will highlight four related projects in University of Minnesota, all geared towards database support for recommender systems: (1) RecStore, which provides a storage layer support for recommender systems, (2) RecBench, which provides a benchmark for various recommender systems architectures, (3) LARS, which provides a location-aware recommender systems, and (4) Recathon, which provides a database support for context-aware recommender systems.
Bio: Mohamed Mokbel (Ph.D., Purdue University, MS and BS, Alexandria University) is an associate professor in the Department of Computer Science and Engineering, University of Minnesota. His current research interests focus on providing database and platform support for spatio-temporal data, location-based services 2.0, personalization, and recommender systems. Mohamed is the main architect for the PLACE, Casper, and CareDB systems that provide a database support for location-based services, location privacy, and personalization, respectively. His research work has been recognized by three best paper awards at IEEE MASS 2008, IEEE MDM 2009, and SSTD 2011, and by the NSF CAREER award 2010. Mohamed is/was general co-chair of SSTD 2011, program co-chair of ACM SIGSPAITAL GIS 2008-2010, and MDM 2014, 2011. He has served in the editorial board of IEEE Data Engineering Bulletin, Distributed and Parallel Databases Journal, and Journal of Spatial Information Science. Mohamed is an ACM and IEEE member and a founding member of ACM SIGSPATIAL. For more information, please visit: www.cs.umn.edu/~mokbel

27 March 2013, 2:30 pm, DC 1302

Title: Indexing Massive Data Sets
Speaker: Michael A. Bender, Stony Brook University and Tokutek, Inc
Abstract:

This talk describes write-optimization techniques used in the TokuDB database, developed at Tokutek. TokuDB uses Fractal-Tree indexes, which support asymptotically optimal point-queries and inserts that run one to two orders of magnitude faster than traditional indexes.

I first explain how to build write-optimized data structures, addressing theoretical and engineering issues. Write-optimized storage systems can ingest and query data orders of magnitude faster than a traditional file systems and databases. For example, a prototype file system based on this technology can support over 20,000 file creates per second on a single disk.

Bio: Michael A. Bender is an associate professor of computer science at Stony Brook University and Chief Scientist at Tokutek, Inc. His research interests include analysis of algorithms, scheduling, parallel computing, data structures, and I/O-efficient computing on large data sets.

Bender has coauthored over 90 articles. He won an R&D 100 Award for scheduling in parallel computers. He has also won three awards for both graduate and undergraduate teaching.

Bender co-founded Tokutek in 2006. He has held Visiting Scientist positions at both MIT and King's College London.

Bender received his B.A. in Applied Mathematics from Harvard University in 1992 and obtained a D.E.A. in Computer Science from the Ecole Normale Superieure de Lyon, France in 1993. He completed a Ph.D. on Scheduling Algorithms from Harvard University in 1998.

Tuesday, 16 April 2013, 2:30 pm, DC 1302 (Not the usual day of the week)

Title: The SAP HANA Platform: Vision and R&D Agenda
Speaker: Franz Faerber, SAP
Abstract: The data processing needs of modern enterprises require solutions that go far beyond the traditional DBMS technologies. During the last several years these needs have exploded along several dimensions including volume, velocity and variety. The SAP HANA platform is based on breakthrough in-memory computing research and has the aim of enabling enterprises to make smarter, faster real-time decisions by providing insights based on all of the enterprise data. Franz will present the technical foundations of the SAP HANA platform and discuss a very ambitious vision based on R&D efforts as well as collaborative research with academia. He will review some past and current research initiatives as well as discuss the need for ongoing collaboration.
Bio:

Franz Faerber is Senior Vice President of SAP HANA – Technology and Innovation Platform. In this role he is responsible for SAP's database development. His team develops the SAP HANA platform (database and development platform components). SAP HANA platform integrates database and development platform components based on innovative technology, optimized for SAP solutions while fully leveraging all talents and IP. This database platform will inhibit the fastest database for SAP applications with the lowest TCO while serving as the best platform for developing new business applications - both SAP and non-SAP applications.

Faerber reports to Vishal Sikka, member of the SAP executive board leading Technology and Innovation Platform. Faerber joined SAP in 1994 as a developer and acquired a wide and deep understanding of SAP's business and technology with a clear focus on data management. Prior to joining SAP, Faerber was working at IBM in Boeblingen as a developer. Faerber is an engineer (BA) in computer science.

29 May 2013, 2:30 pm, DC 1302

Title: Alternate Ways to Search Twitter
Speaker: Nick Koudas, University of Toronto
Abstract: I will be discussing some recent work we have been doing on identifying expertise and interest of users on Twitter. I will present an approach to search and identify users with expertise or interest in particular topics. I will then present a set of problems arising when one considers these approaches in the context of advertising.
Bio:

Nick Koudas if a professor of computer science at the University of Toronto. He conducts research in social media analytics and big data analytics.