MASc Seminar Notice: Impact of data quality on ML models: Improving data quality with Outlier Detection

Thursday, March 28, 2024 3:00 pm - 4:00 pm EDT (GMT -04:00)

Candidate: Rakshit Sharma
Date: March 28, 2024
Time: 3:00pm
Location: MS Teams
Supervisor: Sebastian Fischmeister
All are welcome!

Abstract:

In the dynamic landscape of Machine Learning (ML) applications, data quality comes out to be an important factor that impacts the performance of ML models. Through this thesis, we present a study that proposes innovative methods for enhancing data quality through an iterative data recapture approach. This research primarily focuses on univariate time-series data where specific patterns can be extracted.

We start by discussing existing data capture methods, where the data is collected manually or using some hardware devices. The proposed methods, namely Sessionized Recapture Strategy (SRS) and Robust Single Capture Method (RSCM), are meticulously detailed, offering distinct strategies for iterative data recapture.

The Single Capture Method (SCM) and Recapture and Visualize Method (RVM) serve as the two baseline methods, with their data capture time and a consequential False Positive Rate (FPR). SRS is the enhancement of RVM, and RSCM is the enhancement of SCM. This thesis also introduces an outlier detection algorithm named Outlier detection through ParameterlEss Robust Algorithm (OPERA), which, when added with SCM and RVM, results in SRS and RSCM respectively.

Compared with the baseline methods, the proposed methods show promising results and improvement in the data quality of the captured data. The experiments are performed on two datasets: one dataset is captured in the Embedded Systems Lab on one of the ANVIL products for Future Technology Devices International (FTDI) chips, and the second dataset is Electrocardiogram (ECG), provided by PhysioNet and is publicly available [20]. The research concludes with synthesizing key findings and recommendations for practitioners seeking to optimize model performance through enhanced data quality.