Jaejun Lee, Master’s candidate
David R. Cheriton School of Computer Science
Used for simple voice commands and wake-word detection, keyword spotting (KWS) is the task of detecting pre-determined keywords in a stream of utterances. A common implementation of KWS involves transmitting audio samples over the network and detecting target keywords in the cloud with neural networks because on-device application development presents compatibility issues with various edge devices and provides limited supports for deep learning. Unfortunately, such an architecture can lead to unpleasant user experience because network latency is not deterministic. Furthermore, the client-server architecture raises privacy concerns because users lose control over the audio data once it leaves the edge device.
From a comprehensive efficiency evaluation on desktops, laptops and mobile devices, it is found that in-browser keyword detection takes only 0.5 seconds and achieves a high accuracy of 94% on the Google Speech Commands dataset. From an empirical study, accuracy of Honkling is found to be inconsistent in practice due to different accents. To ensure high detection accuracy for every user, I explore fine-tuning the trained model with user personalized recordings. From the comprehensive experiments, it is found that such a process can increase the absolute accuracy up to 10% with only five recordings per keyword. Furthermore, the study shows that in-browser fine-tuning only takes eight seconds in the presence of hardware acceleration.
200 University Avenue West
Waterloo, ON N2L 3G1