Automatic Speech Recognition Inference on the Nvidia Jetson.
The inference engine on this website allows you to test real time speech recognition inference on the Nvidia Jetson TX2 module for embedded AI computing at the edge. The model consists of a layer of 256 convolutional neurons, 2 layers of 512 bidirectional recurrent neurons, and a layer of time distributed dense neurons. The model was trained using Keras/Tensorflow on an Nvidia GTX 1070 GPU and deployed on an apache web server on a Jetson TX2 using flask in python.
My goal was to build a character-level ASR system using a recurrent neural network in TensorFlow that can run inference on an Nvidia Jetson with a word error rate of <20%.
The primary dataset used is the LibriSpeech ASR corpus which includes 1000 hours of recorded speech. A 100 hour(6G) subset of the dataset of audio files was used for model development. The final model was trained on a 460 hour subset. The dataset consists of 16kHz audio files between 10-15 seconds long of spoken English derived from read audiobooks from the LibriVox project.
The training server contains an Intel 7700k, overclocked to 4.8GHz with 32Gb ram clocked at 2400Hz, with an Nvidia GTX1070 clocked to 1746Mhz (1920 Pascal Cores). The inference server is a Jetson TX2 Developer Kit (256 Pascal Cores).
Feature Extraction and Engineering
There are 3 primary methods for extracting features for speech recognition. This includes using raw audio forms, spectrograms, and mfcc's. For this project, I have created a character level sequence-to-sequence model using spectrograms. This allows me to train a model on a data set with a limited vocabulary that can generalize to more unique/rare words better. This comes at the cost of making a model that is more; computationally expensive, difficult to interpret/understand, susceptible to the problems of vanishing or exploding gradients as the sequences can be quite long.
Raw Audio Waves (pictured above)
This method uses the raw wave forms of the audio files and is a 1D vector where X = [x1, x2, x3...]
This transforms the raw audio wave forms into a 2D tensor where the first dimension corresponds to time (the horizontal axis), and the second dimension corresponds to frequency (the vertical axis) rather than amplitude. We lose a little bit of information in this conversion process as we take the log of the power of FFT. This can be written as log |FFT(X)|^2.