Authors: Mohit Wadhwa, Anurag Gupta, Prateek Kumar Pandey
Acknowledgements: Paulami Das, Head of Data Science CoE, and Anish Roychowdhury, Senior Analytics Leader, Brillio
Organizations: Brillio Technologies, Indian Institute of Technology, Kharagpur
As human beings speech is amongst the most natural way to express ourselves. We depend so much on it that we recognize its importance when resorting to other communication forms like emails and text messages where we often use emojis to express the emotions associated with the messages. As emotions play a vital role in communication, the detection and analysis of the same is of vital importance in today’s digital world of remote communication. Emotion detection is a challenging task, because emotions are subjective. There is no common consensus on how to measure or categorize them. We define a SER system as a collection of methodologies that process and classify speech signals to detect emotions embedded in them. Such a system can find use in a wide variety of application areas like interactive voice based-assistant or caller-agent conversation analysis. In this study we attempt to detect underlying emotions in recorded speech by analysing the acoustic features of the audio data of recordings.
2. Solution Overview
There are three classes of features in a speech namely, the lexical features (the vocabulary used), the visual features (the expressions the speaker makes) and the acoustic features (sound properties like pitch, tone, jitter, etc.).
The problem of speech emotion recognition can be solved by analysing one or more of these features. Choosing to follow the lexical features would require a transcript of the speech which would further require an additional step of text extraction from speech if one wants to predict emotions from real-time audio. Similarly, going forward with analysing visual features would require the excess to the video of the conversations which might not be feasible in every case while the analysis on the acoustic features can be done in real-time while the conversation is taking place as we’d just need the audio data for accomplishing our task. Hence, we choose to analyse the acoustic features in this work.
Furthermore, the representation of emotions can be done in two ways:
Discrete Classification: Classifying emotions in discrete labels like anger, happiness, boredom, etc.
Dimensional Representation: Representing emotions with dimensions such as Valence (on a negative to positive scale), Activation or Energy (on a low to high scale) and Dominance (on an active to passive scale)
Both these approaches have their pros and cons. The dimensional approach is more elaborate and gives more context to prediction but it is harder to implement and there is a lack of annotated audio data in a dimensional format. The discrete classification is more straightforward and easier to implement but it lacks the context of the prediction that dimensional representation provides. We have used the discrete classification approach in the current study for lack of dimensionally annotated data in the public domain.
3. Data Sources
The data used in this project was combined from five different data sources as mentioned below:
TESS (Toronto Emotional Speech Set): 2 female speakers (young and old), 2800 audio files, random words were spoken in 7 different emotions.
SAVEE (Surrey Audio-Visual Expressed Emotion): 4 male speakers, 480 audio files, same sentences were spoken in 7 different emotions.
RAVDESS: 2452 audio files, with 12 male speakers and 12 Female speakers, the lexical features (vocabulary) of the utterances are kept constant by speaking only 2 statements of equal lengths in 8 different emotions by all speakers.
CREMA-D (Crowd-Sourced Emotional Multimodal Actors Dataset): 7442 audio files, 91 different speakers (48 male and 43 female between the ages of 20 and 74) of different races and ethnicities, different statements are spoken in 6 different emotions and 4 emotional levels (low, mid, high and unspecified).
Berlin: 5 male and 5 female speakers, 535 audio files, 10 different sentences were spoken in 6 different emotions.
4. Features used in this study
From the Audio data we have extracted three key features which have been used in this study, namely, MFCC (Mel Frequency Cepstral Coefficients), Mel Spectrogram and Chroma. The Python implementation of Librosa package was used in their extraction.
Choice of features
MFCC was by far the most researched about and utilized features in research papers and open source projects.
Mel spectrogram plots amplitude on frequency vs time graph on a “Mel” scale. As the project is on emotion recognition, a purely subjective item, we found it better to plot the amplitude on Mel scale as Mel scale changes the recorded frequency to “perceived frequency”.
Researchers have also used Chroma in their projects as per literatures, thus we also tried basic modeling with only MFCC and Mel and with all MFCC, Mel, Chroma. The model with all of the features gave slightly better results, hence we chose to keep all three features
Details about the features are mentioned below.
MFCC (Mel Frequency Cepstral Coefficients)
In the conventional analysis of time signals, any periodic component (for example, echoes) shows up as sharp peaks in the corresponding frequency spectrum (i.e. Fourier spectrum. This is obtained by applying a Fourier transform on the time signal). Any cepstrum feature is obtained by applying Fourier Transform on a spectrogram. The special characteristic of MFCC is that it is taken on a Mel scale which is a scale that relates the perceived frequency of a tone to the actual measured frequency. It scales the frequency in order to match more closely what the human ear can hear. The envelope of the temporal power spectrum of the speech signal is representative of the vocal tract and MFCC accurately represents this envelope.
A Fast Fourier Transform is computed on overlapping windowed segments of the signal, and we get what is called the spectrogram. This is just a spectrogram that depicts amplitude which is mapped on a Mel scale.
A Chroma vector is typically a 12-element feature vector indicating how much energy of each pitch class is present in the signal in a standard chromatic scale.
5. Pre Processing
As the typical output of the feature extracted were 2D in form, we decided to take a bi-directional approach using both a 1D form of input and a 2D form of input as discussed below
1D Data Format
These features obtained from extraction from audio clips are in a matrix format. To model them on traditional ML algorithms like SVM and XGBoost or on 1-D CNN, we considered converting the matrices into the 1-D format by taking row means and column means. Upon preliminary modelling the results obtained from the array of row means turned out to be better than the array of column means, so we proceeded with the 1-D array obtained from row means of the feature matrices.
2D Data Format
The 2D features were used in the deep learning model (CNN). The y-axis of the feature matrices obtained depends on the n_mfcc or n_mels parameter we choose while extracting data. The x-axis depends upon the audio duration and the sampling rate we choose while feature extraction. Since the audio clips in our datasets were of varying lengths ranging from just under 2 seconds to over 6 seconds, steps like choosing one median length where we’ll clip all audio files and pad all shorter files with zeroes to maintain dimensions wouldn’t be feasible. This is because this would have resulted in the loss of information for longer clips and the shorter clips would be just silence for the latter half of their audio length. To check this problem, we decided to use different sampling rates in extraction in accordance with their audio lengths. In our approach any, audio file greater or equal to 5 seconds was clipped at 5 seconds and sampled at 16000 Hz and the shorter clips were sampled such that the audio duration * sampling rate multiple remains 80000. In this way, we were able to maintain the dimensions of the matrix for all audio clips without losing much of the information.
6. Exploratory Data Analysis
The combined data set from the original 5 sources is thoroughly analysed with respect to the following aspects
Emotion distribution by gender
Variation in energy across emotions
Variation of relative pace and power across emotions
We checked the distribution of labels with respect to emotions and gender and found that while the data is balanced for six emotions viz. neutral, happy, sad, angry, fear and disgust, the number of labels was slightly less for surprise and negligible for boredom. While the slightly fewer instances of surprise can be overlooked on account of it being a rarer emotion, the imbalance against boredom was rectified later by clubbing sadness and boredom together due to them being similar acoustically. It’s also worth noting that boredom could have been combined with neutral emotion but since both sadness and boredom are negative emotions, it made more sense to combine them.
Emotion Distribution of Gender
Regarding the distribution of gender, the number of female speakers was found to be slightly more than the male speakers, but the imbalance was not large enough to warrant any special attention. Refer Figure. 1
Fig.1 Distributions of emotion with respect to gender
Variation in Energy Across Emotions
To ensure uniformity in our study of energy variation as the audio clips in our dataset were of different lengths, a power which is energy per unit time was found to be a more accurate measure. This metric was plotted with respect to different emotions. From the graph See Fig. 2) it is quite evident that the primary method of expression of anger or fear in people is a higher energy delivery. We also observe that disgust and sadness are closer to neutral with regards to energy although exceptions do exist.
Figure 2 Distributions of emotion with respect to gender
Variation of Relative Pace and Power with respect to Emotions
A scatter-plot of power vs relative pace of the audio clips was analysed and it was observed that the ‘disgust’ emotion was skewed towards the low pace side while the ‘surprise’ emotion was skewed more towards the higher pace side. As mentioned before, anger and fear occupy the high power space and sadness and neutral occupy the low power space while being scattered pace-wise. Only, the RAVDESS dataset was used for plotting here because it contains only two sentences of equal length spoken in different emotions, so the lexical features don’t vary and the relative pace can be reliably calculated.
Figure 3 Scatter of power Vs relative pace of audio clips
The solution pipeline for this study is depicted in the schematic shown in Fig. 4. The raw signal is the input which is processed as shown. At first the 2D features were extracted from the datasets and converted into 1-D form by taking the row means. A measure of noise was added to the raw audio for 4 of our datasets (except CREMA-D as the others were studio recording and thus cleaner). The features were then extracted from those noisy files and our dataset was augmented with them. Post feature extraction we applied various ML algorithms such as SVM, XGB, CNN-1D(Shallow) and CNN-1D on our 1D data frame and CNN-2D on our 2D-tensor. As some of the models were overfitting the data, and taking into consideration a large number of features (181 in 1D) we tried dimensionality reduction to check overfitting and trained the models again.
Figure 4: Schematic of solution pipeline
Selection of Train Test Data
We chose a customized split logic for the various ML models used.
For the SVM and XGB models the model was simply split into train-test data in the ratio of 80:20 and validated using cross-validation with 5-folds. For CNN both 1D and 2D, train-test-split was used consecutively, such that at first the data was split in 90:10 ratio where the 10% was test set, the remaining 90% was again split in 80:20 ratio of train and validation set.
CNN Model Architectures
This model consisted of 1 Convolution layer of 64 channels and same padding followed by a dense layer and the output layer.
This model was constructed in a similar format as VGG-16, but the last 2 blocks of 3 convolution layers were removed to reduce complexity.
This CNN model had the following architectural complexity:
2 convolution layers of 64 channels, 3×3 kernel size and same padding followed by a max-pooling layer of size 2×2 and stride 2×2.
2 convolution layers of 128 channels, 3×3 kernel size and same padding followed by a max-pooling layer of size 2×2 and stride 2×2.
3 convolution layers of 256 channels, 3×3 kernel size and same padding followed by a max-pooling layer of size 2×2 and stride 2×2.
Each convolution layer had the ‘relu’ activation function.
After flattening, two dense layers of 512 units each were added and dropout layers of 0.1 and 0.2 were added after each dense layer.
Finally, the output layer was added with a ‘softmax’ activation function.
Model Results Comparison
The result is based on the accuracy metrics in which there is a comparison between predicted values and the actual values. A confusion matrix is created which consists of true positive (TP), true negative (TN), false positive (FP), and false negative (FN). From confusion metrics, we have calculated accuracy as follows:
The model was trained on training data and tested on test data with different numbers of epochs starting from 50 to 100, 150 and 200. The accuracies were compared among all models viz. SVM, XGBoost and Convolution Neural Network (shallow and deep) for 1D features and 2D features. Fig. 5 shows the comparative performance across different models.
Figure 5 : Comparative Results from different models
We find from Fig.5 that though the CNN-1D Deep model gave the best accuracy on the test set, CNN-1D Deep and CNN-2D models are clearly overfitting the dataset with their training accuracy being is 99% and 98.38 % respectively against a much lower test and validation accuracies. On the other hand, CNN-1D Shallow gave much better results on account of it being more stable with its train, validation and test accuracies being closer to each other, though its testing accuracy was a little lower than the CNN-1D.
Dimensionality Reduction Approach
In order to rectify the overfitting of the models we used a dimensionality reduction approach. PCA technique was employed for dimensionality reduction in 1D features and dimensions were reduced from 180 to 120 with an explained variance of 98.3%. Dimensionality reduction made the model slightly less accurate but reduced the training time, however it didn’t do much to reduce overfitting in the deep learning model. From this we deduced that our dataset is simply not big enough for a complex model to perform well and realised the solution was limited by lack of a larger data volume. Fig. 6 summarizes the results for different models post dimensionality reduction.
Figure 6 : Comparative Results from different models after PCA based dimensionality reduction
Insights from Testing User Recordings
We tested the developed models on user recordings, from the test results we have the following observations
An ensemble of CNN-2D and CNN-1D (shallow and deep) based on a soft voting gave best results on user recordings.
The model often got confused between anger and disgust.
The model also got confused among low energy emotions which are sadness, boredom and neutral.
If one or two words are spoken in higher volume than other words, especially at start or end of a sentence, it almost always classifies as fear or surprise.
The model seldom classifies an emotion as happy.
The model isn’t too noise sensitive, meaning it doesn’t falter as long as background noise is not too high.
Grouping Similar Emotions
Since the model was confusing between similar emotions like anger-disgust and sad-bored, we tried combining those labels and training the model on 6 classes which were neutral, sadness/boredom, happy, anger/disgust, surprise and fear. The accuracies certainly improved on reducing the number of classes, but this introduced another problem with regards to class imbalance. After, combining anger-disgust and sad-boredom, the model developed a high bias towards the anger-disgust. This may have happened because the number of instances of anger-disgust became disproportionately more than the other labels. So, it was decided to stick with the older model.
Final Prediction Pipeline
The final prediction pipeline is depicted schematically in Fig. 7 below.
Figure 7 Final prediction pipeline
8. Conclusions and Future Scope
Through this project, we showed how we can leverage Machine learning to obtain the underlying emotion from speech audio data and some insights on the human expression of emotion through voice. This system can be employed in a variety of setups like Call Centre for complaints or marketing, in voice-based virtual assistants or chatbots, in linguistic research, etc.
A few possible steps that can be implemented to make the models more robust and accurate are the following
An accurate implementation of the pace of the speaking can be explored to check if it can resolve some of the deficiencies of the model.
Figuring out a way to clear random silence from the audio clip.
Exploring other acoustic features of sound data to check their applicability in the domain of speech emotion recognition. These features could simply be some proposed extensions of MFCC like RAS-MFCC or they could be other features entirely like LPCC, PLP or Harmonic cepstrum.
Following lexical features based approach towards SER and using an ensemble of the lexical and acoustic models. This will improve the accuracy of the system because in some cases the expression of emotion is contextual rather than vocal.
Adding more data volume either by other augmentation techniques like time-shifting or speeding up/slowing down the audio or simply finding more annotated audio clips.
The authors wish to express their gratitude to Paulami Das, Head of Data Science CoE @ Brillio and Anish Roychowdhury, Senior Analytics Leader @ Brillio for their mentoring and guidance towards shaping up this study.
Blogs and Documentations:
Ittichaichareon, C. (2012). Speech recognition using MFCC. … Conference on Computer …, 135–138. https://doi.org/10.13140/RG.2.1.2598.3208
Al-Talabani, A., Sellahewa, H., & Jassim, S. A. (2015). Emotion recognition from speech: tools and challenges. Mobile Multimedia/Image Processing, Security, and Applications 2015, 9497(May 2020), 94970N. https://doi.org/10.1117/12.2191623
Sezgin, M. C., Gunsel, B., & Kurt, G. K. (2012). Perceptual audio features for emotion detection. EURASIP Journal on Audio, Speech, and Music Processing, 2012(1), 16. https://doi.org/10.1186/1687-4722-2012-16