Candidate: Md. Milon Islam
Date: November 21, 2024
Time: 10:00 AM
Location: Online - contact the candidate for more information.
Supervisor: Karray, Fakhri (Adjunct)
Abstract:
The significant increase in the number of individuals with chronic ailments has dictated an urgent need for the development of an innovative model for healthcare systems. The design of smart healthcare platforms, a subject of recently growing interest, has become technologically feasible due to the emergence of modern technologies, including smartphones, wearable sensors, 5G communication networks, Internet of Things (IoT), cloud computing, and artificial intelligence, particularly machine learning and deep learning. Together, these technologies enable expanded levels of data storage, computation, and secure communication for devices and servers, thus drastically increasing the degree of mobility, security, and available functionality. The thesis is focused on the development of AI-driven data fusion architecture in smart healthcare platform for a scalable, intelligent, and easily deployable remote monitoring system, aimed at providing cost-effective quality healthcare services in assisted living centers.
This contributions of the thesis are two-folds: we mainly focus on two major aspects of human behavior, including Human Activity Recognition (HAR) and emotion recognition to monitor the elderly in smart healthcare. In the first part, we propose two AI-enabled multimodal fusion approaches for HAR in ambient assisted living. We present a deep learning-based fusion approach for multimodal HAR that fuses the different modalities of data to obtain robust outcomes. Initially, Convolutional Neural Networks (CNNs) retrieve the high-level attributes from the image data, and the Convolutional Long Short Term Memory (ConvLSTM) is utilized to capture significant patterns from the multi-sensory data. Afterwards, the extracted features from the modalities are fused through self-attention mechanisms that enhance the relevant activity data and inhibit the superfluous and possibly confusing information by measuring their compatibility. Later, we propose a multi-level feature fusion technique for multimodal HAR using multi-head CNN with Convolution Block Attention Module (CBAM) to process the visual data and ConvLSTM for dealing with the time-sensitive multi-source sensor information. The architecture is developed to be able to analyze and retrieve channel and spatial dimension features through the use of three branches of CNN along with CBAM for visual information. The ConvLSTM network is designed to capture temporal features from the multiple sensors’ time-series data for efficient activity recognition.
The second part of this thesis focuses on multimodal emotion recognition in connected healthcare to monitor the patient’s health status. To achieve this goal, we introduce two feature fusion approaches through AI tools for emotion recognition. We propose a novel model-level fusion technique based on deep learning for enhanced emotion recognition from multimodal signals to monitor patients in connected healthcare. The representative visual features from the video signals are extracted through the Depthwise Separable Convolution Neural Network, and the optimized temporal attributes are derived from the multiple physiological data utilizing Bi-directional Long Short-Term Memory. A soft attention method fused the high multimodal features obtained from the two data modalities to retrieve the most significant features by focusing on emotionally salient parts of the features. We exploited two face detection methods, Histogram of Oriented Gradients and CNN-based face detector (ResNet-34), to observe the effects of facial features on emotion recognition. Lastly, we introduce a Multi-Stage Fusion Network (MSF-Net) for emotion recognition capable of extracting multimodal information and achieving significant performances. We propose utilizing the transformer-based structure to extract deep features from facial expressions. We exploited two visual descriptors, local binary pattern and Oriented FAST and Rotated BRIEF, to retrieve the computer vision-based features from the facial videos. A feature level fusion network integrates the extraction of features from these modules, directing the output into the triplet attention technique. This module employs a three-branch architecture to compute attention weights to capture cross-dimensional interactions efficiently. The temporal dependencies in physiological signals are modeled by a Bi-directional Gated Recurrent Unit (Bi-GRU) in forward and backward directions at each time step. Lastly, the output feature representations from the triplet attention module and the extracted high-level patterns from Bi-GRU are fused and fed into the classification module to recognize emotion.
Finally, we conduct a series of extensive experiments to demonstrate the performance against State-of-the-Art (SOTA) approaches. The findings from the experimental results reveal that the developed multimodal fusion networks surpass the existing SOTA methods in terms of multiple performance metrics. We deployed an IoT system to test the developed feature-fusion networks in real-world scalable smart healthcare application. The developed multimodal predictive analytics frameworks residing in the cloud and trained on large datasets are continually analyzed the nature of the ingested data, making any appropriate notifications to the patients themselves or the healthcare provider through a user-friendly human-machine interface.