Please note: This master’s thesis presentation will take place online.
Karthik Ramesh, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor N. Asokan
Authentication mechanisms have always been prevalent in our society — even as far back as Ancient Mesopotamia in the form of seals. Since the advent of the digital age, the need for a good digital authentication technique has soared stemming from the widespread adoption of online platforms and digitized content.
Audio-based authentication like speaker verification has been explored as another mechanism for achieving this goal. Specifically, an audio template belonging to the authorized user is stored with the authentication system. This template is later compared with the current input voice to authenticate the current user.
Audio spoofing refers to attacks used to fool the authentication system to gain access to restricted resources. This has been proven to effectively degrade the performance of a variety of audio-authentication methods. In response to this, spoofing countermeasures for the task of anti-spoofing have been developed that can detect and successfully thwart these types of attacks.
The advent of deep learning techniques and their usage in real-life applications has led to the research and development of various techniques for purposes ranging from exploiting weaknesses in the deep learning model to stealing confidential information. One of the ways in which the deep learning-based audio authentication model can be evaded is the usage of a set of attacks that are known as adversarial attacks. These adversarial attacks consist of adding a carefully crafted perturbation to the input to elicit a wrong inference from the model.
We first explore the performance that multimodality brings to the anti-spoofing task. We aim to augment a unimodal spoofing countermeasure with visual information to identify whether it can improve performance. Since visuals can serve as an additional domain of information, we experiment with whether the existing paradigm of using unimodal spoofing countermeasures for anti-spoofing can benefit from this new information. Our results indicate that augmenting an existing unimodal countermeasure with visual information does not provide any performance benefits. Future work can explore more tightly coupled multimodal models that use objectives like contrastive loss.
We then study the vulnerability of deep learning-based multimodal speaker verification to adversarial attacks. In multimodal speaker verification, the vulnerability has not been established and we aim to accomplish this. We find that the multimodal models are heavily reliant on the visual modality and that attacking both modalities lead to a higher attack success rate. Future work can move on to stronger attacks by applying adversarial attacks to bypass the spoofing countermeasure and speaker verification.
Finally, we investigate the feasibility of a generic evasion detector that can block both adversarial and spoofing attacks. Since both the spoofing and adversarial attacks target speaker verification models, we aim to add an adversarial attack detection mechanism — feature squeezing — onto the spoofing countermeasure to achieve this. We find that such a detector is feasible but involves a significant reduction in the identification of genuine samples. Future work can explore combining adversarial training as a defense for attacks that target the complete spoofing countermeasure and speaker verification pipeline.