The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2012-2013 are below.
The talks are usually held on a Wednesday at 2:30 pm in room DC (Davis Centre) 1302. Coffee will be served 30 minutes before the talk.
We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.
The Database Seminar Series is supported by Sybase iAnywhere.
|Title:||Provisioning for What-If Analysis|
|Speaker:||Val Tannen, University of Pennsylvania|
Problems of what-if analysis (such as hypothetical deletions, insertions, and modifications) over complex analysis queries are increasingly commonplace, e.g., in forming a business strategy or looking for causal relationships in science. Here, data analysts are typically interested in only task-specific views of the data, and they expect to be able to interactively manipulate the data in a natural and seamless way - possibly on a phone or tablet, and possibly via a spreadsheet or similar interface without having to carry the full machinery of a DBMS.
The Caravan system enables what-if analysis, i.e., fast, lightweight, interactive exploration of alternative answers, within views computed over large-scale, possibly distributed data sources. Our novel approach is based on creating dedicated provisioned autonomous representations, or PARs. PARs are compiled out of the data, initial analysis queries and user-specified what-if scenarios and are based on c-tables and their recent generalization to aggregation. They allow rapid evaluation of what-if scenarios without accessing the original data or performing complex query operations. Importantly, the size of PARs is governed by the parameters of the what-if analysis and is proportional to the size of the initial query answer rather than the typically much larger source data. Consequently, many what-if analysis tasks performed through PAR evaluations can be done autonomously, on limited-resource devices. We describe our model and architecture, demonstrate preliminary performance results, and present several open implementation and optimization issues.
Joint work with Daniel Deutch (Ben-Gurion U), Zack Ives (UPenn), and Tova Milo (Tel Aviv U).
|Title:||MixApart: Decoupled Analytics for Shared Storage Systems|
|Speaker:||Gokul Soundararajan, NetApp|
|Abstract:||Data analytics and enterprise applications have very different storage functionality requirements. For this reason, enterprise deployments of data analytics are on a separate storage silo. This generates additional costs and inefﬁciencies in data management e.g., whenever data needs to be archived, copied, or migrated across silos. We design MixApart, a scalable data processing framework for shared enterprise storage systems. With MixApart, a single consolidated storage back-end manages enterprise data and services all types of workloads, thus simplifying data management and lowering hardware costs for enterprises. In addition, MixApart enables the local storage performance required by data analytics through an integrated data caching and scheduling solution. We expect that our decoupled, stateless cache design will be most useful for cross-data center deployments and for transparent, and consistent refresh of analytics data upon updates to underlying enterprise data. We evaluate MixApart using a 100-core Amazon EC2 cluster with micro-benchmarks and production workload traces. Our evaluation shows that MixApart provides i) up to 28% faster performance than the traditional ingest-then-compute workﬂows used in enterprise IT analytics, and ii) comparable performance to an ideal Hadoop setup without data ingest, at similar cluster sizes.|
|Bio:||Gokul Soundararajan is a member of technical staff at NetApp. He currently works on techniques to improve the efficiency and management of dynamic data centers. Prior to joining NetApp, in 2010, Gokul received his PhD degree from the University of Toronto in Electrical and Computer Engineering. For his dissertation, Gokul developed novel techniques to improve performance and the manageability of multi-tenant shared data centers. Other projects he has worked on include Griffin, a system to improve the lifetimes of SSDs, and a database provisioning system. He received a bachelor of applied science degree in 2003 and a master of applied science degree in 2005 from the University of Toronto.|
|Title:||Social Media Data Analytics Research|
|Speaker:||Shivakumar Vaithyanathan, IBM Almaden Research Center|
|Abstract:||Social media is an interactive vehicle for communication used daily by hundreds of millions of people. Businesses can derive significant benefits from listening to what users are publicly saying on social media, by transforming the massive amounts of textual content into insights specific to enterprise applications. Defining, extracting and representing entities such as people, organizations and products, and their inter-relationships enables the building of comprehensive consumer profiles that can be leveraged in enterprise applications such as customer retention and acquisition, campaign management and lead generation. Building these social media profiles requires a combination of text analytics and entity resolution, while the utilization of such profiles in applications requires statistical models and machine learning. In this talk, I will describe the work in progress at IBM Research on how such consumer insights, both at the level of an individual and at the level of appropriate micro-segments, can be used in applications in companies ranging from movie studios to financial services and insurance companies. I will also provide a brief overview of text, entity and statistical analysis tools that can operate in a distributed environment over very large amounts of data.|
|Bio:||Shivakumar Vaithyanathan is the IBM Chief Scientist for Text Analytics and the Department Manager of the Intelligent Information Systems Group at the IBM Almaden Research Center. Since joining IBM in 1998, he has been involved in multiple research areas including development of learning algorithms, especially for extremely high-dimensional sparse data. His department is currently involved in building systems for Scalable Unstructured Analytics, Enterprise Search and Large-scale machine learning and Statistical Modeling. Multiple technologies developed in his department currently ship with several IBM products including IBM's Big Data Products.
Prior to IBM, Shivakumar was part of the newly formed Altavista Group at Digital. Shivakumar was an invited keynote speaker at the 2011 German Database Conference and 2011 ACM SiGIR Industrial Track. He is also an Associate Editor of Journal of Statistical Analysis and Data Mining.
|Title:||There and Back Again: Outlier Detection between Statistical Reasoning and Efficient Database Methods (PDF)|
|Speaker:||Arthur Zimek, University of Alberta|
|Abstract:||While classical statistical methods for outlier detection had a focus on probabilistic reasoning, research on outlier detection in the database context during the last decade focused on the development of ever more efficient methods to compute outlier scores without much reasoning about the meaning of these scores. In this talk, we sketch this development and introduce some methods that go back again to statistical reasoning on top of the efficient database techniques. As we demonstrate, this opens up new possibilities for the design of ensemble methods for outlier detection.|
Dr. Arthur Zimek is a postdoctoral fellow at the University of Alberta, Edmonton, AB, working with Prof. Dr. Joerg Sander. Formerly, he worked as scientific assistant in the database systems and data mining group of Hans-Peter Kriegel at Ludwig-Maximilians-University Munich, Germany. He finished his PhD thesis in computing science on ''Correlation Clustering'' in summer 2008.
He received the ''SIGKDD Doctoral Dissertation Award (runner-up)'' in 2009. He received the ''Best Paper Honorable Mention Award'' at SDM 2008 and the ''Best Demonstration Paper Award'' at SSTD 2011 together with his co-authors. His research interests include unsupervised data mining (clustering and outlier detection), especially for high dimensional data, and evaluation of unsupervised data mining methods. He serves as a reviewer for several top database and data mining journals (e.g. VLDB Journal, IEEE TKDE, ACM TKDD, Data Min. Knowl. Disc., Stat. Anal. Data Min.) and as a member of program committees in data mining conferences (e.g. ACM SIGKDD 2011, 2012, ECML PKDD 2012).
|Title:||Database Support for Recommender Systems (PDF)|
|Speaker:||Mohamed Mokbel, University of Minnesota|
|Abstract:||Recommender systems have become popular in both academia and industry, where the main purpose is to suggest to users useful and interesting items or contents from a considerably large set of items. Recommender systems are implicitly employed on a daily basis to recommend movies (e.g., NetFlix), friends (e.g., Facebook), news articles (e.g., Google News), and books/products (e.g., Amazon). Over the last decade, the main focus of the recommender system community was mainly on quality of answers, while efficiency was put back as a secondary issue. In this talk, we show how recommender systems can benefit form database technologies to boost its performance and scalability. In particular, we will highlight four related projects in University of Minnesota, all geared towards database support for recommender systems: (1) RecStore, which provides a storage layer support for recommender systems, (2) RecBench, which provides a benchmark for various recommender systems architectures, (3) LARS, which provides a location-aware recommender systems, and (4) Recathon, which provides a database support for context-aware recommender systems.|
|Bio:||Mohamed Mokbel (Ph.D., Purdue University, MS and BS, Alexandria University) is an associate professor in the Department of Computer Science and Engineering, University of Minnesota. His current research interests focus on providing database and platform support for spatio-temporal data, location-based services 2.0, personalization, and recommender systems. Mohamed is the main architect for the PLACE, Casper, and CareDB systems that provide a database support for location-based services, location privacy, and personalization, respectively. His research work has been recognized by three best paper awards at IEEE MASS 2008, IEEE MDM 2009, and SSTD 2011, and by the NSF CAREER award 2010. Mohamed is/was general co-chair of SSTD 2011, program co-chair of ACM SIGSPAITAL GIS 2008-2010, and MDM 2014, 2011. He has served in the editorial board of IEEE Data Engineering Bulletin, Distributed and Parallel Databases Journal, and Journal of Spatial Information Science. Mohamed is an ACM and IEEE member and a founding member of ACM SIGSPATIAL. For more information, please visit: www.cs.umn.edu/~mokbel|
|Title:||Indexing Massive Data Sets|
|Speaker:||Michael A. Bender, Stony Brook University and Tokutek, Inc|
This talk describes write-optimization techniques used in the TokuDB database, developed at Tokutek. TokuDB uses Fractal-Tree indexes, which support asymptotically optimal point-queries and inserts that run one to two orders of magnitude faster than traditional indexes.
I first explain how to build write-optimized data structures, addressing theoretical and engineering issues. Write-optimized storage systems can ingest and query data orders of magnitude faster than a traditional file systems and databases. For example, a prototype file system based on this technology can support over 20,000 file creates per second on a single disk.
|Bio:||Michael A. Bender is an associate professor of computer science at Stony Brook University and Chief Scientist at Tokutek, Inc. His research interests include analysis of algorithms, scheduling, parallel computing, data structures, and I/O-efficient computing on large data sets.
Bender has coauthored over 90 articles. He won an R&D 100 Award for scheduling in parallel computers. He has also won three awards for both graduate and undergraduate teaching.
Bender co-founded Tokutek in 2006. He has held Visiting Scientist positions at both MIT and King's College London.
Bender received his B.A. in Applied Mathematics from Harvard University in 1992 and obtained a D.E.A. in Computer Science from the Ecole Normale Superieure de Lyon, France in 1993. He completed a Ph.D. on Scheduling Algorithms from Harvard University in 1998.
|Title:||Alternate Ways to Search Twitter|
|Speaker:||Nick Koudas, University of Toronto|
|Abstract:||I will be discussing some recent work we have been doing on identifying expertise and interest of users on Twitter. I will present an approach to search and identify users with expertise or interest in particular topics. I will then present a set of problems arising when one considers these approaches in the context of advertising.|
Nick Koudas if a professor of computer science at the University of Toronto. He conducts research in social media analytics and big data analytics.