A hybrid CNN-LSTM model for speaker independent command word recognition

Ebun Phillip  Fasina; Babatunde Alade  Sawyerr; Chibuzor  Nwalor; Ogban  Ugot

Ebun Phillip Fasina Department of Computer Sciences, University of Lagos, Akoka. Nigeria
Babatunde Alade Sawyerr Department of Computer Sciences, University of Lagos, Akoka. Nigeria
Chibuzor Nwalor Department of Computer Sciences, University of Lagos, Akoka. Nigeria
Ogban Ugot Department of Computer Sciences, University of Lagos, Akoka. Nigeria

Keywords: Command Word Recognition, Convolutional Neural Network, Long Short-Term Memory, Deep Learning, Recurrent Neural Network, Natural Language Processing

Abstract

Automatic speech keyword recognition is an important subset of general speech recognition. It is especially relevant in situations with limited computational resources, such as voice command recognition in low-power/low-memory device and robot interaction. This paper introduces a method for performing efficient Speaker Independent Real Time Command Word Recognition using a hybrid Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network with only 9.8K trainable parameters. CNN extracts short-term spatial features from Mel Frequency Cepstral Coefficients of command words arranged into an image-like format. LSTM learns extracted spatial features as long-term dependences. The model is trained and evaluated on the Google Speech Commands dataset on which it achieved an accuracy of 83%, a memory requirement that is 2-5% of state-of-the-art models and a faster response time when compared to off-the-shelf models.