Experiment results and analysis - Auditory-based categorical emotion recognition

Chapter 3 Auditory-based categorical emotion recognition

3.3.4 Experiment results and analysis

1) Comparison Experiments

There were three comparison experiments named 3D CNN, 3D CLSTM, and 3D CRNN-sv. All these models had the same layers from conv1 to pool2 with the shape of 300x160.

3D CNN: Adding two extra 2D convolutional layers and pooling layer (with 2x2 kernel and 2x2 stride) onto the top of pool2, and then was followed by a fully connected layer.

3D CLSTM: Similar to the 3D CRNN model except without the RNN1 and MV layer.

For RNN2, the whole sequence with the shape of 300x160 was fed into the LSTM model, and max-pooling was used to generate 128 feature sequences.

3D CRNN-sv: This was a single-view way for the 3D CRNN model. Similar to the 3D CRNN model except without the MV layer. The output size of FC was 128.

2) Experiments results

To train the models in a speaker-independent manner, we used leave-one-session-out cross-validation. We used utterances from eight speakers to construct the training databases and used the other two speakers for the test.

We used two measures to evaluate the performance: WA and UA. WA is the classification accuracy of the entire test data set, and UA is the average of the classification accuracy for each emotion. The results obtained for each method are shown in Fig. 5. They show that the 3D CRNN with multi-view results in better recognition accuracy with 61.98% and 60.93% in WA and UA measures. This shows that more multiscale information was obtained from the multi-view model. The results also show that the 3D CNN had poorer accuracy than that of the other models because of the absence Table 3.2: Setup for modulation spectral features

of a recurrent layer. This also demonstrates the importance of the sequential dependencies information for emotion recognition from speech.

Table 3 shows that the proposed method outperformed the other methods. Han et al.

[64] firstly extracted the segment-level emotion state distributions utilizing the features (F0 and MFCC) based on the DNN model and used an ELM to identify utterance-level emotions. Chernykh et al. [63] proposed a CTC approach based on RNN to recognize the utterance-level emotions utilizing MFCC and spectrum properties like flux and roll-off features. The method of Ghosh et al. [70] learns utterance specific representations by a combination of stacked autoencoders and bidirectional LSTM trained on 128 bin FFT spectrograms. Overall, the proposed approaches significantly outperform the previous best accuracy result with 5.88% (from 56.1% to 61.98%) and 6.93% (from 54% to 60.93%) absolute accuracy improvement in WA and UA measures, respectively.

Method Features Models WA UA

Han et al.[64] MFCC and F0 DNN-ELM 54.3% 48.2%

Ghosh et al. [70] FFT spectrograms BLSTM-autoencoder 48.1% 49.09%

Chernykh et al. [63] MFCC and spectrum RNN with CTC 54% 54%

Neumann et al. [71] 13 MFCCs Attentive CNN 56.1% -

Our work Auditory features 3D CRNN 61.98% 60.93%

General discussion

In the two-stage method, we set an energy threshold and filter out the segments whose energy is less than the threshold. Eventually, about 25.7% of segments are obtained to train the model. We also try to reduce the threshold to increase the training data but found that these low-energy segments have little effect on emotion recognition. After using the energy-based filtering method, some utterances have 11 segments, but some utterances are completely filtered out because of the low energy. Each speaker in the CISIA database has 2400 utterances. At last, only about 60% of these utterances can recognize emotion when they are used for testing. This shows that the two-stage method based on energy has obvious defects. In the end-to-end emotion recognition experiment, each utterance is processed by a soft segmentation method, all the data will not be filtered out, and then

use the max-pooling method to automatically grasp the significant parts of the speech.

Secondly, the two-stage method only considers the use of the auditory filter, and does not consider the spectral-temporal modulation cues, which are more important for speech Table 3.3: Comparison of the proposed method and other methods on IEMOCAP database

perception. In the end-to-end mode, a modulation filterbank is introduced to generate high-resolution spectral-temporal modulation cues provided by the time domain envelope and its modulation frequency components. These cues contain multi-dimensional information. Therefore, an end-to-end 3D CRNN is designed to extract the high-level emotion feature sequence from the spectral-temporal representation and construct the temporal-dependence of the sequence. The chapter studied auditory-inspired end-to-end recognition of emotional speech using a 3D CRNN model based on temporal modulation cues. Convolutional networks can reconstruct multiscale spectral-temporal representations, and recurrent networks can obtain the long-term dependencies for emotion recognition. The experimental results demonstrate that our method is an effective way to design an emotion recognition system by mimicking the human auditory system.

Summary

In this chapter, we first investigated the two-stage emotion recognition from multichannel acoustic frequency components of Gammatone filterbank. Since part of the corpus do not involve the two-stage model training, and temporal envelope modulation is not considered, we proposed the end-to-end emotion recognition using a 3D CRNN model based on temporal modulation cues. Convolutional networks are used to learn the joint spectral-temporal representations from temporal modulation cues, and recurrent networks are used to obtain long-term dependencies for emotion recognition. The experimental results demonstrate that proposed methods are effective to identify the emotional states by mimicking the human auditory system.

However, to reduce the training cost, the speech sequence is segmented into non-overlapping subsequences through soft segmentation in the end-to-end method. These discontinuous segment-level features cannot fully reflect the dynamic changes of emotions. Therefore, how to effectively simulate the auditory system to capture salient emotion regions is also an important issue of SER. Additionally, the modulation frequency components in the end-to-end method only include the local information about variations of intensity and duration. The periodicity information is also effective for emotion recognition, so whether this information can be extracted from temporal modulation cues.

ドキュメント内 JAIST Repository https://dspace.jaist.ac.jp/ (ページ 60-63)