PhD Seminar • Data Systems • On Sampling From Data With Duplications

Wednesday, September 22, 2021 12:00 pm - 12:00 pm EDT (GMT -04:00)

Please note: This PhD seminar will be given online.

Alireza Heidarikhazaei, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Ihab Ilyas

Data deduplication is the task of detecting records in a database that correspond to the same real-world entity. We proposed procedures to sample uniformly from a set of entities present in the database. Generally, this task requires a two-stage process. First, it estimates the frequencies of all the entities in the database and in the second step, it uses rejection sampling to obtain a (approximately) uniform sample from the set of entities. However, efficiently estimating the frequency of all the entities is not trivial task and not attainable in general cases. Hence, in this work, we consider various natural properties of the data under which this frequency estimation (and consequently uniform sampling) is possible. Under each of those assumptions, we provide sampling algorithms and give rigorous proofs of the complexity (both statistical and computational) of our approach. We complement our study by conducting extensive experiments on both real and synthetic datasets.

To join this PhD seminar on Zoom, please go to