PhD Seminar: Semi-supervised Clustering for De-duplication

Shrinu Kushagra, PhD candidate
David R. Cheriton School of Computer Science

Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem. We introduce a framework which we call promise correlation clustering. Given a complete graph G with the edges labeled 0 and 1, the goal is to find a clustering that minimizes the number of 0 edges within a cluster plus the number of 1 edges across different clusters (or correlation loss). The optimal clustering can also be viewed as a complete graph $G^*$ with edges corresponding to points in the same cluster being labeled 0 and other edges being labeled 1.

Under the promise that the edge difference between G and $G^*$ is "small," we prove that finding the optimal clustering (or $G^*$ ) is still NP-Hard. Furthermore, the problem is NP-Hard (assuming the ETH-hypothesis) even when given access to an oracle (when the learner is allowed to interact with a domain expert, asking whether two given instances belong to the same cluster or not).

Given these negative results, we consider a restricted version of correlation clustering. As before, the goal is to find a clustering that minimizes the correlation loss. However, we restrict ourselves to a given class $\mc F$ of clusterings. We offer a semi-supervised algorithmic approach to solve the restricted variant with success guarantees.

Location Information

Location Address: DC - William G. Davis Computer Research Centre
200 University Avenue West
2310
Waterloo, ON, CA N2L 3G1

Location coordinates:

Waterloo.ai