PhD Seminar: Semi-supervised Clustering for De-duplication

Wednesday, March 27, 2019 4:30 pm - 4:30 pm EDT (GMT -04:00)

Shrinu Kushagra, PhD candidate
David R. Cheriton School of Computer Science

Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem. We introduce a framework which we call promise correlation clustering. Given a complete graph G with the edges labeled 0 and 1, the goal is to find a clustering that minimizes the number of 0 edges within a cluster plus the number of 1 edges across different clusters (or correlation loss). The optimal clustering can also be viewed as a complete graph \(G^* \) with edges corresponding to points in the same cluster being labeled 0 and other edges being labeled 1.

Under the promise that the edge difference between G and \(G^* \) is "small," we prove that finding the optimal clustering (or \(G^*\)) is still NP-Hard. Furthermore, the problem is NP-Hard (assuming the ETH-hypothesis) even when given access to an oracle (when the learner is allowed to interact with a domain expert, asking whether two given instances belong to the same cluster or not).

Given these negative results, we consider a restricted version of correlation clustering. As before, the goal is to find a clustering that minimizes the correlation loss. However, we restrict ourselves to a given class \(\mc F\) of clusterings. We offer a semi-supervised algorithmic approach to solve the restricted variant with success guarantees.