Candidate: Benyamin Ghojogh
Title: Data Reduction Algorithms in Machine Learning and Data Science
Date: February 2, 2021
Time: 1:00 PM
Place: REMOTE ATTENDANCE
Supervisor(s): Crowley, Mark - Karray, Fakhri
Raw data are usually required to be pre-processed for better representation or discrimination of classes. This pre-processing can be done by data reduction, i.e., either reduction in dimensionality or numerosity (cardinality). Dimensionality reduction can be used for feature extraction or data visualization. Numerosity reduction is useful for ranking the data points or finding the most and least important data points. This thesis proposes several algorithms for data reduction, known as dimensionality and numerosity reduction, in machine learning and data science. Dimensionality reduction tackles feature extraction and feature selection methods while numerosity reduction includes prototype selection and prototype generation approaches. This thesis focuses on feature extraction and prototype selection for data reduction. Dimensionality reduction methods can be divided into three categories, i.e., spectral, probabilistic, and neural network-based methods. The spectral methods have a geometrical point of view and are mostly reduced to the generalized eigenvalue problem. Probabilistic and network-based methods have stochastic and information theoretic foundations, respectively. Numerosity reduction methods can be divided into methods based on variance, geometry, and isolation.
For dimensionality reduction, under spectral category, we propose weighted Fisher discriminant analysis, Roweis discriminant analysis, and image quality aware embedding. We also propose quantile-quantile embedding as a probabilistic method where the distribution of embedding is chosen by user. Backprojection, Fisher losses, and and dynamic triplet sampling using Bayesian updating are other proposed methods in the neural network-based category. Backprojection is for training shallow networks with a projection-based perspective in manifold learning. Two Fisher losses are proposed for training Siamese triplet networks for increasing and decreasing the inter- and intra-class variances, respectively. Two dynamic triplet mining methods, which are based on Bayesian updating to draw triplet samples stochastically, are proposed. For numerosity reduction, principal sample analysis and instance ranking by matrix decomposition are the proposed variance-based methods; these methods rank instances using inter-/intra-class variances and matrix factorization, respectively. Curvature anomaly detection, in which the points are assumed to be the vertices of polyhedron, and isolation Mondrian forest are the proposed methods based on geometry and isolation, respectively. Note that since the proposed tools are mostly used for data reduction as pre-processing which comes before the classification, regression, or clustering tasks, they can be applied off-line; hence, their computational complexity is not a big issue.
To assess the proposed tools developed for data reduction, we apply them for some applications in medical image analysis, image processing, and computer vision. Data reduction, as a pre-processing tool, has different applications because of feature extraction and prototype selection in different types of data. Dimensionality reduction extracts informative features and prototype selection selects the most informative data instances. For example for medical image analysis, we use Fisher losses and dynamic triplet sampling for embedding the histopathology image patches. We also propose offline/online triplet mining using extreme distances for this embedding. In image processing and computer vision application, we propose Roweisfaces and Roweisposes for face recognition and 3D action recognition, respectively, using the proposed Roweis discriminant analysis. We also introduce the concepts of anomaly landscape and anomaly path using the proposed curvature anomaly detection and use them to denoise images. We report extensive experiments, on different datasets, to show the effectiveness of the proposed algorithms.