DB Meeting - Document Size Distribution

Wednesday, February 12, 2014 2:30 pm - 2:30 pm EST (GMT -05:00)
Speaker: Andrew Kane
Abstract: I will present a practice talk for our LSDS-IR 2014 workshop paper on document size distribution in the context of search engines, then give a few related ideas that could be explored by interested grad students. Workshop paper synopsis: Search engines split large datasets across multiple machines using document distribution. Documents are typically distributed randomly, but we propose that documents be distributed by their size instead. This produces immediate improvements in both index size and query throughput. We show improvements to an in-memory conjunctive list intersection system using simple16 compression and either skips or bitvectors. We also expect significant performance improvements in ranking based search systems.