End-to-End Speech Synthesis for Indian Languages: A Multilingual Perspective

  • 20



Name of the Speaker: Anusha Prakash (EE17D039)
Dr. S. Umesh
Co-Guide: Dr. Hema A. Murthy

Link: SSB 233 (MR1), 1st floor, CSE
Date/Time: 20th December 2022 (Tuesday), 2.00 PM

Speech is one of the most widely used forms of communication. A text-to-speech (TTS) synthesiser is an important speech technology which generates speech corresponding to a given text. Traditional approaches to training a TTS system, such as unit selection synthesis (USS) and hidden Markov model (HMM) based synthesis, rely on language-specific modules and accurate segmented boundaries at the sub-word level. The hand-crafting of these modules makes system building quite difficult and time-consuming. With the advent of neural network-based end-to-end (E2E) approaches, training TTS systems has become easier when a large amount of data is available for a language. Systems can be trained quickly using accurate pairs aligned at the sentence level. However, building E2E speech synthesisers for Indian languages is challenging, given the lack of adequate clean training data and multiple grapheme representations across languages.

In this talk, I will present two main contributions of our work: (1) A multi-language character map (MLCM) to handle the issue of different scripts across languages. (2) A language-family based perspective to system building. The objective is to exploit the phonotactic properties of language families, where small amounts of accurately transcribed data across languages can be pooled together to train TTS systems. Experiments in low-resource and zero-shot scenarios highlight the efficacy of the proposed approaches.