Muhammad Babar Kamal,Arfat Ahmad Khan,Faizan Ahmed Khan,Malik Muhammad Ali Shahid,Chitapong Wechtaisong,*,Muhammad Daud Kamal,Muhammad Junaid Ali and Peerapong Uthansakul
1COMSATS University Islamabad,Islamabad Campus,45550,Pakistan
2Suranaree University of Technology,Nakhon Ratchasima,30000,Thailand
3COMSATS University Islamabad,Lahore Campus,54000,Pakistan
4COMSATS University Islamabad,Vehari Campus,61100,Pakistan
5National University of Sciences&Technology,Islamabad,45550,Pakistan
6Virtual University of Pakistan,Islamabad Campus,45550,Pakistan
Abstract: The deep learning advancements have greatly improved the performance of speech recognition systems, and most recent systems are based on the Recurrent Neural Network (RNN).Overall, the RNN works fine with the small sequence data, but suffers from the gradient vanishing problem in case of large sequence.The transformer networks have neutralized this issue and have shown state-of-the-art results on sequential or speech-related data.Generally, in speech recognition, the input audio is converted into an image using Mel-spectrogram to illustrate frequencies and intensities.The image is classified by the machine learning mechanism to generate a classification transcript.However, the audio frequency in the image has low resolution and causing inaccurate predictions.This paper presents a novel end-to-end binary view transformer-based architecture for speech recognition to cope with the frequency resolution problem.Firstly, the input audio signal is transformed into a 2D image using Mel-spectrogram.Secondly,the modified universal transformers utilize the multi-head attention to derive contextual information and derive different speech-related features.Moreover, a feedforward neural network is also deployed for classification.The proposed system has generated robust results on Google’s speech command dataset with an accuracy of 95.16% and with minimal loss.The binary-view transformer eradicates the eventuality of the over-fitting problem by deploying a multiview mechanism to diversify the input data,and multi-head attention captures multiple contexts from the data’s feature map.
Keywords: Convolution neural network; multi-head attention; multi-view;RNN;self-attention;speech recognition;transformer
The recent surge of Artificial Intelligence (AI)in modern technology has resulted in the widespread adoption of Human-Computer-Interaction (HCI)applications.Big corporations in information technology like Google, Apple, Microsoft, and Amazon are relentlessly working to improve the applicability and dynamics of HCI applications using speech recognition algorithms.The importance of recognition systems underscores vast fields, including stakeholders from the domain related to entertainment applications, utility applications and critical lifesaving appliances.e.g., YouTube [1] and Facebook [2] use speech recognition systems for captioning.Various robust voice commands applications are proposed for devices that work in areas without internet services and critical mission’s robots[3,4].Moreover,robust design of micro-controller-based devices which works based on speech commands are also proposed in literature[5].Apple Siri,Amazon Alexa,Microsoft Cortana,YouTube captions,and Google Assistant[6]deploy speech recognition systems which works based on these designs.Google and Microsoft [7], use deep neural networks-based algorithms that convert sound to text through speech recognition,process the text,and respond accordingly.Typically,deep learning algorithms processes the 1D data as audio is recorded and represented s a 1D waveform[8].The waveform of the 1D signal is represented in the sinusoidal time domain.In [9], the authors studied the 2D representation of an audio signal called the spectrogram where the frequencies spectrum is derived from the time-frequency domain through Fourier transform.
Speech signals contains rich prominent features such as emotions and dialect.Studies have been conducted to compare the 1D audio waveform and 2D spectrogram, the spectrogram concluded that the 1D signal does not contain frequency information vital for better speech recognition [10].Studies shows that 2D spectrogram performs better to extract features for speech recognition.Since a spectrogram focuses on all the frequencies,the recognition system cannot properly differentiate[11]between relevant frequencies and noise.Fusion of mel-scale with spectrogram reduces noise which shows performance improvement in speech recognition.The mel-scale discards noise and amplifies the desired spectrum of frequencies in the 2D spectrogram.The 2D transformation(mel-spectrogram)of audio signal deploy state-of-the-art image recognition algorithms in Neural Networks (N.N)for speech recognition to improve the precision of the system by imitating the human speech perception[12].The N.N algorithms [13] process raw input data by correlating hidden patterns to recognize similar clusters in data and classify it by continuously learning and enhancing the recognition system.Recurrent NNs(RNNs)[14-16],Convolutional NNs(CNNs)[17]and Attention are commonly used to develop speech recognition systems.RNN captures sequential prediction of data using recurrence units to predict pattern for next likely scenario.RNN algorithms and their variants,i.e.,Long-Short-Term-Memory(LSTM),and Gated-Recurrent-Unit(GRU)allow the machine to process sequential data models,such as speech recognition.LSTM has become popular in recurrent networks due to its success in solving the vanishing gradient problem by retaining the long-term dependencies of data.
However, the LSTM [18] fails to solve the vanishing gradient problem completely due to the complexity of the additional evaluation of memory cells.The RNN models are prone to over-fitting due to the difficulty of applying dropout algorithms with LSTM probabilistic units.The sequential nature of models is inconsistent with the parallelization of processing[19].RNN models require more resources and time to train due to the linearized natures of layers and random weights initialization.Many researchers have used CNN for audio classification to analyze visual imagery of audio by convolving multiple filters on data to extract features for the neural network.Deep CNN convolves multiple layers of filters on image to extract distinct features having depth depending on the number of layers.Deep networks improve algorithm’s ability by capturing unique properties using multiple convolution layers to retrieve a higher level of features.The feature-map produced from this process enhances the recognition system accuracy.
However, these studies observe that the deeper layers of convolution tends to assimilate general/abstract level information from the input data [20].The deep CNN model tends to over-fit when the labeled data for training is less.The deep networks of the convolution model are prone to gradient vanishing/exploding problems as the network deepens,causing less precision of the recognition model.Therefore, the researchers deploy attention mechanisms with an RNN model to obtain long-term dependencies by contextualizing the feature-map.The attention model uses probabilistic learning by giving weight to the important feature using the soft-max probabilistic function.Moreover, the attention-based models reduce the vanishing gradient problem by decreasing the number of features to process important and unique features for the recognition system[21].In[22],the authors introduce one of the attention mechanism variations,self-attention,to compute the representation of the same sequence relating to different positioning.Self-attention allows input sequences to interact with all neighboring values and find contextual and positional attention within the same sequence.In[23],the authors observe the multi-view approach with a neural network algorithm to increase the efficiency of the architecture.The main objective of the paper is to improve existing speech recognition systems by building a precise method that can be implemented in any speech recognition application with a lightweight footprint.
In[24],the authors use Fourier transform to convert the waveform signal to alternative representations characterized by a sinusoidal function.The paper uses Infrared spectroscopy through Fourier transform for analysis of biological material.In [25], the Short-Time Fourier-Transform (STFT)is used to extract features from the signal of audio by slicing the signal into windows and performing Fourier transform on each window to obtain meaningful information.Actually, Deep Learning(DL)[26] models extract intricate structures in data, and back-propagation algorithms show which parameters are used for calculating each layer representation.In fact,DL allows the computation of multiple processing for the learning of data representation having many levels of abstractions.In[27],authors elaborate the feature extraction in speech categorizing speech recognition to three stages.At first,the audio signal is divided into small chunks;secondly,the phoneme is extracted and processed,and lastly, the speech is categorized on word level.Music detection is discussed in [28], where the authors use CNN with mel kernel to separate music content from speech and noise.The mel-scale is useful for focusing on a specific type of frequency and minimizing the effect of noisy and unrelated parts.
In [29], an attention model is used for audio tagging of Google Audio Set [30].The authors investigate Multi Instance-Learning (MIL)problem for weakly labeled audio set classification by introducing the attention model for probabilistic learning, where attention is used with a fully connected neural network for multi-label classification on audio.Multi-head attention is used in[31],where authors elaborate the implication to extract information from multiple representation subspaces at various positions by the ability of multi-head to attend to different interpretations within the data jointly.The multi-head attention is useful for obtaining different contexts within the information which improve the efficiency of the model.
In this paper, we present a novel end-to-end binary view transformer-based architecture for speech recognition to cope with the frequency resolution problem.Firstly, the input audio signal is transformed into a 2D image using Mel-spectrogram.Secondly, the multi-view mechanism is used to enhance the frequency resolution in the image.In addition, the Modified universal transformers utilized the multi-head attention to derive contextual information and derive different speech-related features.A feed-forward neural network is also deployed for classification.The proposed system is discussed in details in the Section 5.Moreover,the proposed system has generated robust results on Google’s speech command dataset with an accuracy of 95.16% and with minimal loss.The binaryview transformer eradicates the eventuality of the over-fitting problem by deploying a multi-view mechanism to diversify the input data,and multi-head attention captures multiple contexts from the data’s feature map.
The rest of the paper is organized as follows:The Section 2 contains the speech perception and recognition by using AI,and the proposed system is discussed in the Section 3.The Section 4 includes the experiment steps and testing.Furthermore, the Section 5 includes the experiment results and discussions.Finally,the Section 6 concludes the research work.
Perception is the ability to systematically receive information,identify essential data features and then interpret that information,while recognition is the system’s ability to identify the classification of data.To build a system using AI for the speech recognition,we need to have input data that is in the form of an audio signal.After pre-processing, the audio signal progresses to the speech recognition system,and the systems output will be a classification transcript of the audio.A microphone records the audio signal with a bit depth of 16 (recorded signal in time domain having values of 2 * 16).Audio is recorded at 16 kilohertz having a nitrous frequency of 8 kilohertz;the nitrous is a range of distinguished lower frequency,which is interpretable and differentiable by the brain for speech because most frequency changes happen at lower frequencies.The signal in the time domain is complicated to interpret, as the human ear can sense the intensity of frequency.Moreover, we use a pre-processing step to convert the signal into the frequency domain using Fourier transform,where the time-domain representation of the signal is transformed into a time-frequency domain.
The power spectral density of the audio signal for different bands of frequencies are shown in the Fig.1,where the nitrous frequency range has most frequencies changes.We create a spectrogram by stacking periodogram adjacent to one another over time.The spectrogram is a colored 2D image representation of the audio.
Figure 1: Periodogram of audio frequencies and the 2D representation of audio signal using spectrogram
For speech recognition,the human brain amplifies some frequencies,while nullifying or reducing the background noise by giving more importance to the lower band of frequencies.For example,humans can tell the difference between 40 and 100 hertz,but are unable to differentiate between 10,000 and 12,000 hertz.This objective in computing is achieved through mel-scale; by applying mel-filterbank on the frequencies,we can retrieve the lower frequencies efficiently,as shown in the Fig.2.
Figure 2:Mel filter-banks and the frequencies from linear to logarithmic
In the field of machine learning, CNN is one of the most efficient algorithms for image recognition.Since the inception of CNN, the field of machine learning is revolutionized, and stateof-the-art results are produced.In CNN, different filters are convolved over the image to compute essential features by using the Eqs.(1)and(2),where filter B convolves over image A having k number of rows and columns.Convolution gives us a large pool of features in data that is passed to a N.N.,which helps to classify them into different classes.Many variants of CNN are produced over the years that improve the performance of these models,e.g.,Inception net[32],Resnet[33],and mobile net[34].
RNN algorithms allow the machine to process temporal sequences of variable lengths[14].This type of algorithm is useful in processing sequential data through sequential modelling, e.g., signal processing,NLP,and speech recognition.The RNN models produce hidden-vector as a function of the former states and the next input as shown in Eq.(3),where input vectorsAare sequentially processed by recurrenceFunctionhavingwparameters on each time-stamp to produce a new state for the model.
Recurrence models generate a sequential pattern of data that prevents parallelization for training data.The sequential nature increases the computation time of the model and limits longer sequences from processing,which causes the gradient vanishing/exploding problem.
Attention is a deep learning mechanism that is mainly inspired by the natural perception ability of humans as humans receive information in raw form from the senses and transmit it to the brain[29].The brain opts for the relevant and useful information by ignoring background noises; this process polishes the data, making it easier to perceive.Moreover, the attention is a weighted probabilistic vector with the soft-max function used in a neural network, which was introduced to improve the sequential models(LSTMs,RNNs)to capture essential features in context vectors as shown in Eq.(4),andAttention_weightis elaborated in the Eq.(5).
The attention mechanism extracts model dependencies while the effect of distance between input and output sequences is negated, which improves the model performance.Self-attention [35] is a variation of the attention mechanism that allows the vectors to interact with each other to discover the important features,so that more attention can be given.Applying attention to sequential models improves the accuracy,and state of the art results are achieved.
In transformer architecture, instead of sequential representation, the positional information(input data)is embedded in positional vector with input vectors that permit parallelization.Transformer architecture consists of two parts, i.e., encoder layers and decoder layers.In a transformer,attention mechanism is used for content-based memory retrieval, where decoder attends to content that is encoded and decides which information needs to be extracted based on affinity or its position.The input data is processed by the transformer in the form of pixel blocks[36]i.e.,each row of image are embedded,and the positional information of data are encoded by using positional encoder into the input embedding,which is subsequently passed to transformer for processing.
The positional information is extracted from the data by using positional encoderE, which is added to the input embedding.The input embedding and positional encoder have same dimensiond,so that both can be summed.The positional encoding is extracted using sine function by using the Eq.(6),and cosine function in Eq.(7)alternatively.The Eq.(6)is used for odd values,and Eq.(7)is used for even value as shown in the Eq.(8)fornlength input sequence.The sine and cosine functions are used to create a unique pattern of values for each position.
wherepis the position to encode,andfiare the frequencies ofinumbers up tod/2 as shown in equation Eq.(9).
In Transformer encoder layer, the embedded inputX= {x1, x2, x3,...xn} is fed into three fully connected layers to create three embeddings, namely keys, query and value; these embeddings are commonly used in the search retrieval system.During the search retrieval,the query is mapped against some keys that are associated with search candidates; this presents best match searches (values).To compute the attention value of input embeddingx1againstx2as shown in the Fig.3, transformer self-attention;Firstly,theQ,K,andVare randomly initialized having same dimension as the input embedding.The inputx1is matrix-multiplied withQto produce Query embeddingQe,and embeddingx2is matrix-multiplied withKto produce Key embeddingKe,then the resultant matrixes dot product(weighted score matrixZ)is calculated.The scoresZare then scaled-down as shown in Eq.(10)for a stable gradient,wheredKeis the dimension of keys embedding.
The softmax function in Eq.(11)is applied toZs={z1,z2,z3,...zn}to calculate attention weights,giving probability values between zero and one.The Fig.3 input embedding is a matrix which is multiplied with theVeto produce value embedding.
Figure 3:Transformer:Encoding layer self-attention and the transformer layer and model
TheZsmultiplies with theVe.This process is repeated for all the inputs neighborhood, andVeare then concatenated.The functionality is elaborated in Eq.(12),whereKetis the transpose of keys embedding.The self-attention produces weighted embeddings for each input.
The architecture proposed in this paper is a novel binary-view transformer architecture,an endto-end model for a speech recognition system, which is inspired by the human perception of speech and is articulated by carefully studying human physiology.The architecture consists of two primary modules i.e.,(i)Pre-processing and feature extraction module and(ii)Classification module.
Three convolution layers are applied for the feature extraction on both inputs.The filter size is 3×3, and numbers of filters are 32, 64, and 1, respectively.After each layer of convolution, batch normalization is implemented with an activation function.Both of the inputs are then concatenated to add the extracted features in multi-view i.e.,binary view model.
Our system incorporates a modified universal transformer[19,22],where multi-head self-attention is used with four heads capturing four different contexts at the same time.The depth of the transformer is six, i.e., six encoding and six decoding transformer layers are implemented.The transformer is tuned to 25 percent dropout after each layer,and a high-performance gaussian-error linear unit[37]activation function of the neural network is used.The adaptive computations time algorithm is then used with the aim of allowing the neural network to determine computation steps for getting inputs and computing outputs.The resultant vectors then proceed to global average pooling[38],where mean value for every feature map is computed, and soft-max then determines its probabilities.Lastly, the feature map is passed to a dense layer,i.e.,fully-connected layer;of 128 nodes and subsequently another dense layer of nodes equal to desire classes where the input is classified to its respective class.It is important to noted that the classification is vital by considering the fact that the internet data traffic is increasing with every passing day[39-42].The working of system is shown in Algorithm 1.
For training the proposed model, we use Google’s data-set of speech command [43,44] created by Google Brain, which has speech audio files in WAV format, having a total of 105,829 command utterance files.The data-set audio files have a length of 1 s,divided into 35 classes of words.The audio was recorded in a 16-bits mono channel, and the command files are collected from 2618 different speakers having a range of dialects.
The tool used for training the architecture is Google cloud service for machine learning,namely Google-colab, which uses a jupyter-notebook environment, and Tesla Graphics Processing Unit(GPU)K80 is provided by Google having 12 GB of GPU.
We trained different architecture for speech recognition of 35 classes,which includes our binaryview transformer model and the models introduced by paper[3],i.e.,LSTM and attention-based recurrent convolutional architectures.We also experimented with well-known convolutions architectures of resnet and inception net, where we modify our model by replacing the Transformer with Resnet(Fig.4), proposed architecture (Fig.5)and Inception net (Fig.6).We then compute and Recurrent compare their results.We then compute and Recurrent compare their results.
Figure 4:Binary-view ResNet
Figure 5: System architecture of binary-view convolution the transformer, which captures multiple feature contexts of each class,making recognition more rob
Figure 6:Binary-view inception net
Initially,the experiments of paper[3]were replicated,including the LSTM model and attention with a recurrent convolutional network.The purpose of the experimentation results is to compare and demonstrate the efficiency and shortcomings of different neural network models for the speech recognition.In terms of accuracy, we improved the validation accuracy by gradually decreasing the learning rate over epochs and increasing filters in convolutions and introducing batch normalization.The LSTM model[3]accuracy is recorded up to 93.9,and the attention based recurrent convolutional network [3] accuracy is 94.2.The transformer architecture without multiple inputs gives 94.24%accuracy.The binary-view resnet model,binary view inception net model,binary-view convolutional model,and binary-view transformer model were executed on the dataset,where validation accuracy was 94.91,94.74,95.05,and 95.16,respectively as shown in Fig.7.Moreover,the proposed transformer model produced state-of-the-art as well as a minimalistic number of parameters, i.e., 375,787.The Fig.8 shows the comparison of training and validation accuracies.
?
In terms of loss,the binary-view transformer model validation loss is comparatively less,which is 0.191.Single input transformer model produces 0.227 loss.The binary-view resnet model, binaryview inception net model, binary-view convolutional model losses, and attention based recurrent convolutional network were 0.194, 0.192, 0.21, and 0.237, respectively as can be seen in Fig.9.The decline of loss exhibits a better performance of the architecture and lower chance of the model being over-fitting with the aim of eradicating the gradient vanishing/exploding problem.
Table 1: Results comparison of the proposed approach with existing studies
Table 1:Continued
Figure 7:Validation accuracies of speech recognition models
Figure 8:Training and validation accuracies of implemented models
Figure 9:Validation loss of speech recognition model
This research aimed to improve the speech recognition system.We analyzed human physiology for speech perception.This research aimed to improve the speech recognition system.In addition,Binary-view transformer architecture produced state of the art results on Google’s speech command dataset [43,44].Three aspects of recognition models, i.e., validation accuracy, precision, and loss,were considered to determine the efficiency of binary-view transformer architecture.By introducing a binary-view mechanism, similar data from different sources were processed, and the attention mechanism within the transformer increases efficiency, where the best validation accuracy of 95.16 was achieved.The proposed model decreased the eventuality of gradient vanishing/exploding problem by processing long-term dependencies.Whereas the confusion matrix showed better precision of the binary-view transformer architecture compared to other models since the transformer used a multihead attention mechanism,which catches more contexts of the same data,which helped in improving model precision and the probability of model over-fitting diminish.Better precision on Google’s speech command dataset showed that our model performed better on different dialects because over 2000 speaker’s speech was precisely recognized.As shown in Tab.1,our model exhibited less loss of 0.191 compared to 0.237, 0.194, 0.192, and 0.21 of the attention based recurrent convolutional networks[3],binary-view resnet model,binary-view inception net model and binary-view convolutional model,respectively.The binary-view transformer architecture has a lightweight footprint of 375,787 trainable parameters,which can be run locally on small systems.
Funding Statement:This research was supported by Suranaree University of Technology, Thailand,Grant Number:BRO7-709-62-12-03.
Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.
Computers Materials&Continua2022年9期