Please note: This seminar will take place in DC 2568 and online.
Hoai-An Nguyen, PhD candidate
Computer Science Department, Carnegie Mellon University
In the maximum coverage problem we aim to choose at most k subsets such that the number of distinct items covered by the union of the subsets is maximized. The input can be formalized as a n x d matrix A where there are n items in the universe and d subsets. A_ij is nonzero if item i is in subset j and is 0 otherwise. In this paper, to our knowledge, we create the first linear sketch to solve maximum coverage, potentially leading to large runtime improvements and has direct applications to distributed and streaming settings. We specifically focus on the direct application to turnstile streams which allow deletions. Here, the updates are of the form (i, j, ±1) which performs A_ij = A_ij ± 1. Previous work mainly considers the more restrictive set-arrival model where each update reveals an entire column of A or the insertion-only model which does not allow deletions. We design an algorithm with an O(d/ε^3) space bound, which is nearly optimal for constant k.
We then turn to fingerprinting for risk measurement. The input is again a n × d matrix A where there are n users and d features, and the aim is to determine which k columns together pose the greatest re-identification risk. Our maximum coverage sketch directly enables a solution of targeted fingerprinting for risk measurement. Furthermore, we give a result of independent interest: a sketch related to the complement of Fk for k ⩾ 2. We use this sketch to solve general fingerprinting for risk management. Empirical evaluation confirms the practicality of our fingerprinting algorithms and shows a speedup of up to 210x over prior work. We also demonstrate the use of our general fingerprinting algorithm as a dimensionality reduction technique facilitating enhanced feature selection efficiency.
Based on joint work with Alina Ene, Alessandro Epasto, Vahab Mirrokni, Huy Nguyen, David P Woodruff, and Peilin Zhong
To attend this seminar in person, please go to DC 2568. You can also attend virtually using Zoom.