RTKWS: Real-Time Keyword Spotting Based on Integer Arithmetic for Edge Deployment

Prakash Dhungana and Sayed Salehi
University of Kentucky


Abstract

This paper presents an efficient real-time keyword spotting (KWS) architecture for edge devices. The proposed architecture comprises data acquisition, silence detection, feature extraction (FE), and binary classification units. To minimize the required memory footprint and computational complexity, the architecture uses 8-bit integer voice data and performs all computations only in integers. The FE unit converts input data into 2-dimensional feature maps using a short-time Fourier transform (STFT) to be subsequently used by the classification unit. This unit is powered by a neural network model comprising three convolutional layers and one fully connected layer. The model is quantized using a new approach based on the quantization method in the Tensorflow Lite tool. The model can be trained to accurately classify the feature maps for any pair of desired keywords. We implemented the architecture in pure C code with no external dependencies to make it portable to a general edge device. We deployed the architecture on a low-cost edge device, TM4C123GXL, and the results show an average of 90.25% accuracy for different keyword pairs from Google Speech Commands Dataset (GSCD) v1 with a total required memory of 9.711KB RAM and 13.598 KB Flash.