Candidate: S M Taslim Uddin Raju
Date: April 16, 2025
Time: 11:00am
Location: online - Teams
Supervisor: Fakhri Karray
All are welcome!
Abstract:
Microscopic assessment of histopathology images is vital for accurate cancer diagnosis and treatment. Classification and captioning Whole Slide Images (WSIs) for pathological analysis is an essential but not extensively explored aspect of computer-aided pathological diagnosis. Challenges arise from insufficient datasets and the effectiveness of model training. Generating automatic caption reports for various gastric adenocarcinoma images is another challenge. Moreover, microscopic WSIs face challenges such as redundant patches and unknown patch positions due to subjective pathologist captures.
The thesis is divided into two folds. At first, a hybrid method referred to as TransUAAE-CapGen is introduced to generate histopathological captions from WSI patches. The TransUAAE-CapGen architecture consists of a hybrid UNet-based Advereasrial Autoencoder (AAE) for feature extraction and a transformer for caption generation. The hybrid UNet-based AAE extracted complex tissue properties from histopathological patches, transforming them into low-dimensional embeddings. The embeddings are then fed into the transformer to generate concise captions.
The second fold focuses on how to reduce redundant patches and handle the unknown patch positions due to subjective pathologist captures. To address these challenges, a novel GNN-ViTCap framework is introduced for classification and caption generation from histopathological microscopic images. A visual feature extractor is used to extract feature embeddings. The redundant patches are removed by using deep embedded clustering to dynamically cluster the images and extracting representative images through a scalar dot attention mechanism. The graph is formed by constructing edges from similarity matrix, ensuring that each node is connected to its closest neighbors. Therefore, graph neural network is utilized to extract and represent contextual information from both local and global areas. The aggregated image embeddings are then projected into the language model’s input space using a linear layer and combined with input caption tokens to fine-tune the large language models for caption generation.
Our proposed methods are validated using the BreakHis and PatchGastric microscopic datasets. Several ablation studies have been performed to validate our methods. Experimental analysis demonstrates that the proposed TransUAAE-CapGen and GNN-ViTCap architectures outperform state-of-the-art approaches. Our approaches can be integrated into clinical workflows to facilitate early diagnosis and treatment planning, ultimately enhancing patient care using whole slide images.