MASc Seminar: Audio-Visual Feature Fusion through Transformers for Automated Depression Screening in Social Media Content

Thursday, April 16, 2026 11:00 am - 12:00 pm EDT (GMT -04:00)

Candidate: Md Rezwanul Haque
Date: April 16, 2026
Time: 11:00 AM
Location: Online
Supervisor: Prof. Fakhri Karray
Co-Supervisor: Prof. Pin-Han Ho

All are welcome!

Abstract:
Depression has become a critical public health concern, with the World Health Organization reporting that over 280 million people worldwide are affected by it. The rapid growth of social media, particularly video blogs, has drawn research attention toward analyzing user-generated audiovisual content for signs of depression. These videos capture natural facial expressions, voice characteristics, and speech patterns that may reveal more about a person's emotional state than verbal self-reports alone. However, extracting useful features from such noisy, unstructured data and combining audio and visual information in a way that preserves their complementary nature remain open problems in this domain.

The thesis is organized into two main contributions. In the first part, we propose MDD-Net, a multimodal depression detection network that uses a mutual transformer to fuse acoustic and visual features. The acoustic branch employs a global self-attention network to process 25 low-level descriptors including loudness, Mel-Frequency Cepstral Coefficients, and spectral flux, capturing both content-based and positional relationships. The visual branch applies hierarchical multi-head self-attention on 68 facial landmarks extracted from each video frame. The mutual transformer then operates bidirectionally: audio queries attend to visual keys and values, and visual queries attend to audio keys and values. We also design a composite loss function that combines binary cross-entropy, focal loss, and L2 regularization to handle the noisy labels and class imbalance that are common in social media datasets.

In the second part, we introduce MMFformer, a multimodal fusion transformer network that takes a different approach to the same problem. For video, a pre-trained vision transformer augmented with residual connections extracts high-level spatial patterns from facial data. For audio, a transformer encoder built on the audio spectrogram transformer paradigm models temporal dynamics in speech signals through patch and positional embeddings. On the fusion side, we propose and compare three distinct strategies: late transformer fusion, intermediate transformer fusion, and intermediate attention fusion, each operating at a different level of the processing pipeline.

We evaluate both architectures on the D-Vlog dataset, a publicly available collection of 961 YouTube vlogs from 816 individuals annotated for depression. MMFformer is additionally tested on the LMVD dataset, a larger corpus of 1,823 vlogs collected from four different social media platforms. MDD-Net reaches an F1-Score of 77.07% on D-Vlog, which is an improvement ranging from 1.82% to 17.37% over previously reported methods. MMFformer achieves 90.92% on D-Vlog and 90.48% on LMVD, surpassing the best existing results by 13.92% and 7.74% respectively. Cross-corpus validation between D-Vlog and LMVD further confirms that the developed architectures generalize across different platforms and populations.

Support Waterloo Engineering