Name: Omer Elghalhoud
Date: Dec 5, 2022
Time: 1:00pm
Location: EIT 3141
Supervisors: Sagar, Naik
Title: Data Balancing and Hyper-parameter Optimization for Machine Learning Algorithms for Secure IoT Networks
Abstract:
Nowadays, many industries rely on machine learning (ML) algorithms and their ability to learn from existing data to make inferences about new unlabeled data. Applying ML algorithms to the network security domain is not new. However, without proper data pre-processing and proper optimization of the hyper-parameters (HPs) of ML algorithms, these algorithms might not achieve their full potential. Furthermore, attacks on network infrastructures come in a variety of forms and at different frequencies. Cyber-security experts often require the help of an automated process that filters and classifies attacks. To apply specific preventive measures for securing networks, the classification of the attack type is key. Many Machine Learning (ML) models have been proposed as a base for Network Intrusion Detection (NID) systems. However, their performance varies based on multiple factors. For instance, an ML model fitted on a highly imbalanced dataset may be biased toward over-represented attack types. On the other hand, paying attention only to the ML model’s performance in the minority classes can negatively affect its performance in the majority classes or overall performance. This research proposes a framework that applies pre-processing steps, including data balancing, and utilizes optimization techniques to tune the HPs of random forest, gradient boosting machine, and deep neural networks. The conducted experiments in this research provides a performance comparison between three different optimization algorithms: Tree-structured Parzen Estimator (TPE), Bayesian Optimization and Hyperband (BOHB), and Particle Swarm Optimization (PSO). The research results show that through data balancing and optimization of the HPs and architecture of deep neural networks, their performance can improve significantly: false alarm rate of 0% and only 1.79% using the BoT-IoT and the ToN-IoT benchmark datasets, respectively.
To address the issue of imbalanced datasets, this research gives a data balancing algorithm and compares its performance to other existing approaches that use: Random Over-Sampling (ROS), Synthetic Minority Oversampling TEchnique (SMOTE), Adaptive Synthetic Sampling (ADASYN), and Generative Adversarial Networks (GAN). The data balancing algorithm is combined with Convolutional Neural Networks (CNN) to extract spatial features and classify the different attack types. Using the NSL-KDD and the BoT-IoT datasets for benchmarking, the proposed system achieves high performance in the minority classes: recall scores of 70.50% and 72.08% on the User to Root (U2R) and Remote to Local (R2L) attack classes of the NSL-KDD dataset, respectively, while maintaining an overall False Alarm Rate (FAR) of 6.50% and a recall of 90.46% on the binary classification task. The proposed system scores a weighted average F1-Score of 99.45% on the multi-class classification task using the BoT-IoT dataset.