DB Meeting - Partial Materialization for On-Line Analytical Processing on Multi-Tagged Document Collections

Wednesday, October 1, 2014 2:30 pm - 2:30 pm EDT (GMT -04:00)
Speaker: Greg Drzadzewski
Abstract:

On-Line Analytical Processing (OLAP) systems are commonly used on top of structured data to help users make sense of large data collections by providing them with summary information that can be examined at various levels of detail. Partial materialization has been used as part of these OLAP systems as a way of reducing the time required to calculate summaries as well as satisfying the constraints of limited storage and available time for updates.

When dealing with large collections of tagged documents, one would also benefit from the summarization operations provided by an OLAP system. Such a system could make it less time consuming for users to explore and understand the information contained in large document collections. Tagged document collections, however, require different types of measures for summarizing the data, and the data exhibits considerably different properties than is the case with the data in traditional OLAP. To address these issues, an OLAP system for documents will require a different design and partial materialization approach.
In this talk I will describe a new document centric partial materialization strategy that offers faster average response time to expected query workload compared to the current partial materialization approaches, along with a lower storage space requirement. The performance of this new partial materialization strategy is evaluated over real and synthetic document collections.