Principal Sample Analysis for Data Reduction


Ghojogh, B. & Crowley, M., 2018. Principal Sample Analysis for Data Reduction. In International Conference on Big Knowledge (ICBK) . Singapore: IEEE, 2018.


Data reduction is an essential technique used for purifying data, training discriminative models more efficiently, encouraging generalizability, and for using less storage space for memory-limited systems. 
The literature on data reduction focuses mostly on dimensionality reduction, however, data sample reduction (i.e. removal of data points from a dataset) has its own benefits and is no less important given growing sizes of datasets and the growing need for usable data analysis methods on the network edge.
This paper proposes a new data sample reduction method, Principal Sample Analysis (PSA), which reduces the number (population) of data samples as a preprocessing step for classification. PSA ranks the samples of each class considering how well they represent it and enables better discriminative learning by using the sparsity and similarity of samples at the same time. Data sample reduction then occurs by cutting off the lowest ranked samples. The PSA method can work alongside any other data reduction/expansion and classification method. Experiments are carried out on three datasets (WDBC, AT&T, and MNIST) with contrasting characteristics and show the state-of-the-art effectiveness of the proposed method. 


Last updated on 09/21/2018