Title : Subspace Based Features in Speech Recognition using Neural Networks
Speaker: Murali Karthick B
This thesis deals with the investigation of better input features for training neural network acoustic models. In particular, phonetic discrimination and speaker normalization aspects of features are analyzed in detail. The thesis proposes an alternate type of discriminative feature called “frame- specific vectors” (FSV) using subspace Gaussian mixture models (SGMM). In SGMM, the model parameters are derived from alow-dimensional model and speaker subspaces that capture phonetic and speaker correlations. In this work, we have used FSV as the standalone input and also in tandem with Mel frequency cepstral coefficients (MFCC) and have shown better recognition performance in deep neural network (DNN) acoustic models. In case of speaker normalization for DNN, it has been shown that convolutional neural network (CNN) layers acts as a better extractor of speaker and channel invariant features. State-of-the-art CNN model uses Mel filterbank features to model local correlations. Efforts to improve CNN performance using feature space maximum likelihood linear regression (fMLLR) transform on Mel filterbank features resulted in performance degradation. To overcome this problem, we estimate full covariance Gaussian based fMLLR transforms using SGMM. This method is computationally efficient and requires lesser number of parameters. Secondly, we also augment speaker vectors of SGMM with acoustic features to provide speaker information in CNN models. Experiments show that SGMM based fMLLR features and speaker vectors give better relative improvement over the conventional features such as Mel filterbank features and i-vectors.