Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions. There are many different data cleaning activities performed to improve data quality, such as filling in missing values, removing duplicate records, and fixing integrity constraint violations. There are usually three steps in data cleaning: data quality rules specification, error detection, and error repairing.
In this talk, I will present three projects as examples that tackle the challenges in each of the three steps in data cleaning. The first project, called denial constraints discovery, proposes to use denial constraints (DCs) as the formal language to express a variety of data quality rules. The second project, called holistic data cleaning, advocates the idea of accumulating evidences in detecting and repairing errors, which leads to better cleaning accuracy. The third project, called distributed data deduplication, tackles the scalability challenges in data cleaning.
200 University Avenue West
Waterloo, ON N2L 3G1