| PhD Viva

Name of the Speaker: Mr. Prithviraj Pani (EE16D202)
Guide: Dr. Nitin Chandrachoodan
Co-Guide: Dr. Janakiraman Viraraghavan
Online meeting link: https://meet.google.com/ugy-ueep-vhn
Date/Time: 15th April 2025 (Tuesday), 10:00 AM
Title: Memory and Compute Efficient Real-Time Automatic Speech Recognition for Embedded Hardware Platform

Abstract :

Speech recognition is an increasingly demanding need of present technology. While cloud-based Automatic Speech Recognition (ASR) systems have existed for a long time, security and network speed constraints necessitate their implementation on edge devices. Viterbi decoding is the main contributor to computational and memory bottlenecks. It searches through a large state space to find the most probable sequence of phonemes that led to a given sound. So, deploying ASR on resource-constrained platforms is challenging due to low CPU speed and limited memory. Most works use hash tables and beam-width pruning to restrict the Active State List (ASL). This requires a large memory, numerous acoustic probability computations, and repeated DRAM accesses to achieve decent accuracy and performance.

This thesis proposes to use a binary search tree (BST) and a max heap (MH) data structure to track the worst cost state and efficiently replace it when a better state is found. With this approach, the ASL size can be reduced from over 32K to 512 with minimal impact on recognition accuracy. Combined with a caching technique for acoustic scores, this reduced the DRAM data accessed by 31× and the acoustic probability computations by 26×. With these optimizations, real-time ASR has been implemented on a Xilinx Zynq FPGA with 91% lesser block-RAMs.

Despite significant performance improvement, this optimization hasn’t worked directly for state-of-the-art Transformer-based ASR due to the relatively smaller ASLs used. However, they have high computation complexity and memory requirements. The operations involved in Transformer ASR’s Viterbi decoding have been closely analyzed.The beam search algorithm and the sequential decoder operations are the main bottlenecks for computations and memory consumption. Thus, achieving real-time, large vocabulary ASR with such models on embedded platforms is even harder.

In this work, we replace the beam search with the greedy search. While greedy decoding traditionally exhibited poor accuracy in conventional ASR systems, experiments with Transformer ASR demonstrate that it can achieve accuracy comparable to beam search. This is leveraged to achieve faster than real-time Transformer-based speech decoding on an embedded platform with minimal loss in accuracy. On a large custom dataset, computations are reduced by 15× and memory storage by 40× with only a 0.6% penalty on the error rate. The number of sequential operations also reduced proportionately. These resulted in speeding up the Transformer ASR from decoding 7× slower than real-time to 2.6× faster than real-time on Raspberry Pi 4B.