TITLE : Addressing data sparsity in acoustic modeling for Automatic Speech Recognition
DATE : 24-01-2018
TIME :3:00pm – 4:00pm
VENUE : ESB 244
SPEAKER: Tejaswi Seeram (EE15S044)
GUIDE : Dr. S Umesh
GTC Members :
Dr. Srikrishna B (Chairperson)
Dr. Chandrasekhar C (M) (CS)
Dr. Kaushik Mitra (M)
Application of deep neural networks (DNN) for speech recognition has become fairly ubiquitous in the recent past. One important aspect of these improvements is the availability of thousands of hours of data which makes it possible to build complex models with millions of parameters without over-fitting. But many languages have only a limited amount of properly transcribed data available and acquiring further data becomes costly in terms of man-hours. Developing acoustic models for such low resource languages is considered one of the challenging problems in automatic speech recognition (ASR). The goal of the current work is to build robust acoustic models for low resource languages to reach a performance comparable to that of a high resource scenario with the help of other high resource languages.
We propose techniques which inclue (i) training a DNN in a multilingual (blocksoftmax) framework with an additional KL divergence based regularization constraint on the parameter deviation space. This helps the low resource posterior distribution to not deviate too much from its true distribution; (ii) an architecture to train a DNN based acoustic model for low resource data using distillation framework. An intelligent teacher (multilingual DNN) provides easier targets to the low resource DNN (student). The task is to output a multinomial distribution which is a more ‘achievable’ scenario than a pure classification task; (iii) In multilingual training scenario, the initial layers are a language independent representation while the final layers are responsible for the actual classification of phonemes. These networks, omitting the final layer(s), when used as feature extractors help in getting language independent representation of features with information about the phonetic contexts. DNNs trained using these ‘multilingual features’ with additional supervision from a teacher (which may not always be better than the student) help in building the a robust system for low resource data sets.