PhD Defence • Data Systems | Machine Learning • Structured Prediction on Dirty Datasets

Friday, December 10, 2021 11:00 am - 11:00 am EST (GMT -05:00)

Please note: This PhD defence will be given online.

Alireza Heidarikhazaei, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Ihab Ilyas

Data cleaning is a critically important step in any data related task. Over the past few decades, many algorithms and systems have been presented to clean data. However, most of these solutions are either pure statistical or machine learning models that do not consider the underlying structure of the data or they are rule-based logical methods. In both cases, they fall short in effectively clean and find errors in structured data sets.

Many errors cannot be detected nor repaired without taking into account the underlying structure and dependencies in the dataset. One way of modeling the structure of the data is graphical models. Graphical models combine probability theory and graph theory in order to address one of the key objectives in designing and fitting probabilistic models, which is to capture dependencies among relevant random variables. Structure representation either helps to understand the side effect of the errors or reveal correct interrelation between data points. Hence, principled representation of structure in prediction and cleaning tasks of dirty data is essential to positively impact the quality of downstream analytical results. Existing structured prediction research considers limited structures and configurations, with little attention to what are the performance limitations and how well the problem can be solved in more general settings, where the structure is complex and rich.

In this dissertation, I present the following thesis: “By leveraging the underlying dependency and structure in machine learning models, we can effectively detect and clean errors via pragmatic structured predictions techniques.” To highlight the main contributes, we investigate prediction algorithms and systems on dirty data with more realistic structure and dependencies to help deploy this type of learning in more pragmatic settings. Specifically, we introduce a few-shot learning framework for error detection which uses structure-based features of data like denial constrains violations and Bayesian network as co-occurrence feature. We have studied the problem of recovering the latent ground truth labeling of a structured instance. Then, we consider the problem of mining integrity constraints from data and specifically using the sampling methods for extracting approximate denial constraints. Finally, we have introduced an ML framework that uses solitary and structured data features to solve the problem of record fusion.


To join this PhD defence on Google Meet, please go to https://meet.google.com/hyc-tnnw-cpy.