Implementation of DNN-HMM Acoustic Models for Phoneme Recognition
Hidden Markov Model-Gaussian Mixture Models (HMM-GMMs) are the state-of-the-art for acoustic modeling in speech recognition. HMMs are used to model the sequential structure and the temporal variability in speech signals. However, GMMs are used to model the local spectral variability in the sound wave at each HMM state. Attempts to use Artificial Neural Networks (ANNs) to substitute GMMs in HMM-based acoustic models led to dismissal results for many years. In fact, ANNs could not significantly outperform GMMs due to their shallow architectures. In addition, it was difficult to train networks with many hidden layers on large amount of data using the back-propagation learning algorithm.
In recent years, with the establishment of deep learning technique, ANNs with many hidden layers have been reintroduced as an alternative to GMMs in acoustic modeling, and have shown successful results. The deep learning technique consists of a two-phase procedure. First, the ANN is generatively pre-trained using an unsupervised learning algorithm. Then, it is discriminatively fine-tuned using the back-propagation learning algorithm. The generative pre-training intends to initialize the weights of the network for better generalization performance during the discriminative phase. Combining Deep Neural Networks (DNNs) and HMMs within a single hybrid architecture for acoustic modeling have shown promising results in many speech recognition tasks.
This thesis aims to empirically confirm the capability of DNNs to outperform GMMs in acoustic modeling. It also provides a systematic procedure to implement DNN-HMM acoustic models for phoneme recognition, including the implementation of a GMM-HMM baseline system. This thesis starts by providing a thorough overview of the fundamentals and background of speech recognition. The thesis then discusses DNN architecture and learning technique. In addition, the problems of GMMs and the advantages of DNNs in acoustic modeling are discussed. Finally, DNN-HMM hybrid acoustic modes for phoneme recognition are implemented. The deployed DNN is generatively pre-trained and fine-tuned to produce a posterior distribution over the states of mono-phone HMMs. The developed DNN-HMM phoneme recognition system outperformed the GMM-HMM baseline on the TIMIT core test set. An in-depth investigation into the major factors behind the success of DNNs is carried out.