Scalable Data Cleaning

Wednesday, November 30, 2016 12:30 pm - 12:30 pm EST (GMT -05:00)

Speaker: Xu Chu

Data quality is one of the most important problems in data management and data science, since dirty data often leads to inaccurate data analytics results and wrong business decisions. It is estimated that data scientists spend 60-80% of their time cleaning and organizing data rather than performing modelling or data mining. A typical data cleaning process consists of three steps: data quality rules specification, error detection, and error repairing. In this talk, I will discuss my proposals in dealing with challenges in each of these steps. First, I will introduce a system to automatically discover data quality rules from a possibly dirty sample data instance. Automatically discovering data quality rules is particularly useful since asking users to design them is an expensive process, which requires domain expertise, and is rarely done in practice. Second, I will show a holistic error detection and error repairing process, which accumulates evidence from a broad spectrum of data quality rules, and suggests more accurate data repairs in a holistic manner. Third, I will present a distribution strategy to scale up the common combinatorial operations used in data cleaning such as comparing every tuple pair to detect duplicates. I will conclude the talk by discussing some ongoing work in cleaning relational data as well as other data forms (e.g., IoT data and unstructured data) and my long-term vision of debugging data analytics.