Using Auxiliary Tasks In Multimodal Fusion Of Wav2vec 2.0 And BERT For   Multimodal Emotion Recognition

Dekai Sun; Yancheng He; Jiqing Han

arXiv:2302.13661·cs.CL·February 28, 2023

Using Auxiliary Tasks In Multimodal Fusion Of Wav2vec 2.0 And BERT For Multimodal Emotion Recognition

Dekai Sun, Yancheng He, Jiqing Han

PDF

Open Access

TL;DR

This paper enhances multimodal emotion recognition by integrating pretrained wav2vec 2.0 and BERT models with auxiliary tasks and a multi-head attention fusion mechanism, significantly improving accuracy on the IEMOCAP dataset.

Contribution

It introduces auxiliary tasks to improve modality fusion and leverages pretrained models for better feature extraction in MER.

Findings

01

Achieved 78.42% WA and 79.71% UA on IEMOCAP

02

Improved over previous state-of-the-art models

03

Demonstrated effectiveness of auxiliary tasks in multimodal fusion

Abstract

The lack of data and the difficulty of multimodal fusion have always been challenges for multimodal emotion recognition (MER). In this paper, we propose to use pretrained models as upstream network, wav2vec 2.0 for audio modality and BERT for text modality, and finetune them in downstream task of MER to cope with the lack of data. For the difficulty of multimodal fusion, we use a K-layer multi-head attention mechanism as a downstream fusion module. Starting from the MER task itself, we design two auxiliary tasks to alleviate the insufficient fusion between modalities and guide the network to capture and align emotion-related features. Compared to the previous state-of-the-art models, we achieve a better performance by 78.42% Weighted Accuracy (WA) and 79.71% Unweighted Accuracy (UA) on the IEMOCAP dataset.

Tables3

Table 1. Table 1 : Weighted Accuracy (WA) and Unweighted Accuracy (UA) of the 5-fold CV results using single modality and multi modality.(FC - Fully Connected; CA - Multi-Head Cross Attention (K=1); Aux1 - Auxiliary Task1; Aux2 - Auxiliary Task2.)

Methods	WA( $%$ )	UA( $%$ )
Text-only
BERT	$70.53$	$71.79$
Audio-only
Wav2vec2	$69.92$	$70.68$
Audio and Text
BERT+Wav2vec2+FC	$76.24$	$77.20$
BERT+Wav2vec2+CA	$77.19$	$78.47$
BERT+Wav2vec2+CA+Aux1	$77.67$	$79.16$
BERT+Wav2vec2+CA+Aux2	$78.11$	$79.47$
BERT+Wav2vec2+CA+Aux1&2	$78.34$	$79.59$

Table 2. Table 2 : Performance with different K (the number of layers of Multi-Head Cross Attention (CA)).

Methods	K	WA( $%$ )	UA( $%$ )
BERT+Wav2vec2+CA+Aux1&2	$1$	$78.34$	$79.59$
BERT+Wav2vec2+CA+Aux1&2	$2$	$78.42$	$79.71$
BERT+Wav2vec2+CA+Aux1&2	$3$	$77.68$	$79.41$

Table 3. Table 3 : Comparison of the 5-fold CV results of previous state-of-the-art multimodal models and our model on the IEMOCAP.

Methods	WA( $%$ )	UA( $%$ )
BERT + Wav2vec2 [11]	$-$	$76.31$
RoBERTa-text&audio [10]	$77.70$	$78.50$
BERT + FBK [13]	$77.57$	$78.41$
SMCN [14]	$75.60$	$77.60$
BERT + FBK [19]	$70.56$	$71.46$
MCSAN [12]	$61.20$	$56.00$
Our proposed (best)	$78.42$	$79.71$

Equations18

F_{a} = F_{a} + A tt e n t i o n_{a t} (Q_{a}, K_{t}, V_{t})

F_{a} = F_{a} + A tt e n t i o n_{a t} (Q_{a}, K_{t}, V_{t})

A tt e n t i o n_{a t} (Q_{a}, K_{t}, V_{t}) = so f t ma x (\frac{Q _{a} K _{t}^{T}}{d _{K_{t}}}) V_{t}

A tt e n t i o n_{a t} (Q_{a}, K_{t}, V_{t}) = so f t ma x (\frac{Q _{a} K _{t}^{T}}{d _{K_{t}}}) V_{t}

F_{t} = F_{t} + A tt e n t i o n_{t a} (Q_{t}, K_{a}, V_{a})

F_{t} = F_{t} + A tt e n t i o n_{t a} (Q_{t}, K_{a}, V_{a})

A tt e n t i o n_{t a} (Q_{t}, K_{a}, V_{a}) = so f t ma x (\frac{Q _{t} K _{a}^{T}}{d _{K_{a}}}) V_{a}

A tt e n t i o n_{t a} (Q_{t}, K_{a}, V_{a}) = so f t ma x (\frac{Q _{t} K _{a}^{T}}{d _{K_{a}}}) V_{a}

Q_{a} = W_{Q} F_{a} + b_{a}^{Q}

Q_{a} = W_{Q} F_{a} + b_{a}^{Q}

K_{a} = W_{K} F_{a} + b_{a}^{K}

K_{a} = W_{K} F_{a} + b_{a}^{K}

V_{a} = W_{V} F_{a} + b_{a}^{V}

V_{a} = W_{V} F_{a} + b_{a}^{V}

l ab e l_{or i g ina l} = l ab e l_{a} = l ab e l_{t}

l ab e l_{or i g ina l} = l ab e l_{a} = l ab e l_{t}

l ab e l_{n e w} = l ab e l_{a} \times e m o t i o n_n u m s + l ab e l_{t}

l ab e l_{n e w} = l ab e l_{a} \times e m o t i o n_n u m s + l ab e l_{t}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Dense Connections · Weight Decay · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Attention Dropout · Softmax · Linear Warmup With Linear Decay · Adam

Full text

Using Auxiliary tasks in multimodal fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition

Abstract

The lack of data and the difficulty of multimodal fusion have always been challenges for multimodal emotion recognition (MER). In this paper, we propose to use pretrained models as upstream network, wav2vec 2.0 for audio modality and BERT for text modality, and finetune them in downstream task of MER to cope with the lack of data. For the difficulty of multimodal fusion, we use a K-layer multi-head attention mechanism as a downstream fusion module. Starting from the MER task itself, we design two auxiliary tasks to alleviate the insufficient fusion between modalities and guide the network to capture and align emotion-related features. Compared to the previous state-of-the-art models, we achieve a better performance by 78.42% Weighted Accuracy (WA) and 79.71% Unweighted Accuracy (UA) on the IEMOCAP dataset.

**Index Terms— ** Multimodal emotion recognition, BERT, Wav2vec2.0, Cross-attention, Auxiliary task.

1 Introduction

Multimodal emotion recognition is a significant capability in human-machine interaction and has attracted widespread attention in industry and academia. As we all know that emotions are expressed in extremely complex and ambiguous ways, perhaps through linguistic content, speech intonation, facial expression and body actions. There have been many related studies on text emotion recognition [1, 2], and also on audio emotion recognition [3, 4, 5]. However, by observing these results, the research on single modality has reached a certain bottleneck which leads to increasing attention devoted to the use of multimodal approach. Some studies propose that the information of different modalities is often complementary and verified, and the full use of the information of different modalities can help the model to better learn the key content [6, 7].

In recent years, pretrained self-supervised learning has performed prominently in several research fields such as natural language processing (NLP) [8] and automatic speech recognition (ASR) [9]. For the multimodal emotion recognition (MER) task, there are also studies that have done a lot of exploration on the basis of pretrained models. For the first time, Siriwardhana et al. [5] jointly finetuned modality-specific “BERT-like” pretrained Self Supervised Learning (SSL) architectures to represent both audio and text modalities for the task of MER. Similarly, Yang et al. [10] also proposed to finetune two pretrained self-supervised learning models (Text-RoBERTa and Speech-RoBERTa) for MER. Based on pretrained models, Zhao et al. [11] explored Multi-level fusion approaches, including coattention-based early fusion and late fusion with the models trained on both embeddings. Compared with the MCSAN [12] using traditional features (MFCC & GloVe) for modal fusion, the works mentioned above have greatly improved performance. From the perspective of making full use of contextual data, Wu et al. [13] took advantage of contextual information and proposed a two-branch neural network structure including time synchronous branch and time asynchronous branch. By modifying the structure of network, SMCN [14] realize multi-modal alignment which can capture the global connections without interfering with unimodal learning. However, these previous works focused more on sophisticated fusion structure design and the use of larger and stronger pretrained models, or the use of contextual information that breaks data constraints. They did not start from the MER task itself to explore the bottleneck of insufficient fusion, or capture the feature of emotion itself and the alignment of emotion in different modalities. We believe that the parameters of the network are already sufficient, and the complex fusion module design has not brought enough benefits. Thus, we hope to guide the model to fully exploit the potential of the fusion module by designing just the right auxiliary tasks.

In this work, we propose a modular end-to-end approach for the MER task. The general framework is shown in figure 1. First, we learn the semantic information of the respective modalities through the pretrained models, wav2vec 2.0 [9] for audio modality and BERT [8] for text modality. Then, we map text and audio modal feature information into a unified semantic vector space through a k-layer cross-attention mechanism for more adequate modal fusion. Furthermore, we design two auxiliary tasks to help fully fuse the features of the two modalities and learn the alignment information of the emotion itself between different modalities. In the first one, we randomly recombine text and audio modalities and let the model to predict the combination of the two modalities through the vector obtained by fusion. This decoupling of multimodal data enables the model to see more complex input combinations, and the constraint of this auxiliary task forces the network to not ignore the role of any modality in the task of MER. In the second one, we randomly replace one of the modalities with other data of the same emotion category, and hope that the model can capture the feature related to emotion and the alignment information beyond the content itself.

We comprehensively evaluated the performance of the model proposed on the IEMOCAP dataset in terms of average weighted accuracy (WA) and unweighted accuracy (UA). In additional, we compared it with the SOTA methods and presented relevant ablation experiments that illustrate the effectiveness of each module.

2 Method

The framework of our proposed model is showned in Figure 1, which consists of three modules, i.e., text encoder, audio encoder, and fusion module.

2.1 Text Encoder

The emergence of BERT has brought NLP into a new era, and gradually refreshed the effect of multiple NLP domain tasks. And “Pretrain + Finetune” has gradually become a new paradigm. Pre-training models such as BERT can be used to transform text into word vectors with contextual semantic information. In this paper, we choose bert-base-uncased111https://huggingface.com/bert-base-uncased as the text modal encoder, which consists 12 layers of transformer encoder. It converts the text into 768-dimensional vectors, which are fed into the fusion module. During training, we also finetune its weights to make it more suitable for our multimodal emotion recognition task.

2.2 Audio Encoder

We choose wav2vec2-base222https://huggingface.co/facebook/wav2vec2-base as the audio modality encoder, which consists of feature encoder, contextualized representations with Transformers, and quantization module. The base model contains 12 transformer blocks, and it is pretrained in Librispeech corpus containing 960 hours of 16kHz speech. It is able to learn 768-dimensional latent representation directly from raw audio every 20ms (16Khz sampling rate). We also finetune its parameters during training similar to BERT.

2.3 Fusion Module

The fusion module is based on the multi-head cross attention mechanism [15]. In addition, two auxiliary tasks (Section 2.4) help the model to better handle the feature relationship between the two modalities. Figure 3 shows the specific details of the fusion module, and each layer of the fusion module consists of two branches, which have the same structure but different Q, K, and V. In addition, we use residual linking to reduce the loss of information of the original modalities. The calculation process of multi-head cross attention is as follows:

[TABLE]

where subscript $a$ represents audio modality and subscript $t$ represents text modality. $d_{K_{a}}$ and $d_{K_{t}}$ represent dimension of the embeddings. $F_{t}$ : $(B,T_{t},C)$ is the text feature outputed by BERT, and $F_{a}$ : $(B,T_{a},C)$ is the audio feature outputed by Wav2vec 2.0. $Q_{a}$ , $K_{a}$ , $V_{a}$ are given here (same of $Q_{t}$ , $K_{t}$ , $V_{t}$ ):

[TABLE]

Finally, we average pooling $F_{a}$ and $F_{t}$ in the time dimension, and concatenate them in the feature dimension to obtain the fusion embedding $(B,2C)$ , which is sent to the classifier to get the emotion category.

2.4 Auxiliary Tasks

In order to help the model fully fuse the features of the two modalities and learn the alignment information of the emotion itself between different modalities, we design two auxiliary modal interaction tasks.

2.4.1 Auxiliary Task1

In MER tasks, audio and text have the same semantics. In the modal fusion of the downstream network, we analyze that the reason for insufficient fusion comes from the fact that the overall emotional orientation can be obtained just from the information of one modality. In some cases, this approach leads to the right results. But for complex cases, we want the network to be more “humble”, making full use of the information of the two modalities. As shown in Figure 2, we decouple the pairs of {Audio, Text} in a batch of data, and then randomly scramble and recombine them to get Aux_batch1. During the training process, we not only let the model predict the emotion category of the original data pair, but also predict the combined category of this reorganized data pair {Audio, Text} (a total of $emotion\_num\times emotion\_num$ kinds), and its label ( $label_{new}$ ) is defined as follows:

[TABLE]

The main task MER requires the downstream network to receive the features from the two modalities and output the emotion category, while the auxiliary task 1 requires the downstream network to predict not only the emotion but also the combination of the two modalities according to the fusion embedding. It forces the downstream network to not ignore any modal information during the feature fusion process of the two modalities, that is, both modal information contributes to the final fusion embedding.

2.4.2 Auxiliary Task2

In order to guide the fusion network to learn the alignment information of emotion itself between different modalities, we break the strong semantic correlation between modalities. As shown in Figure 2, for the pairs of {Audio, Text} in a batch of data, we randomly replace one of the modalities (Audio or Text) with other data of the same emotion category. In Aux_batch2, different modalities have same emotional label but different semantics. We hope that the fusion network can focus on the features of emotion itself in different modalities and align them. At the same time, the model can better learn common features of the same emotion category.

3 Experimental setup

3.1 Dataset

The dataset used in the experiment is the Interactive Emotional Dyadic Motion Cap-ture (IEMOCAP) [16], which is a dialogues dataset and performs improvised and scripts by 10 actors. The 10 actors are divided into 5 sessions, and every session consists of 1 male and 1 female. There are a total of 7529 utterances in IEMOCAP (happy 595, excited 1,041, angry 1,103, sad 1,084, neutral 1,708, frustration 1,849, fear 40, surprise 107, disgust 2). To be consistent and compare with previous studies [17], only utterances with ground truth labels belonging to “angry”, “happy”, “excited”, “sad”, and “neutral” were used. The “excited” class was merged with “happy” to better balance the size of each emotion class, which results in a total of 5,531 utterances (happy 1,636, angry 1,103, sad 1,084, neutral 1,708).

3.2 Implementation Details

In order to fully evaluate our proposed model and maintain the same test conditions as previous studies [13], a leave-one-session-out 5-fold cross-validation (CV) configuration was implemented to evaluate our model. We divide IEMOCAP into five folds according to sessions in our experiments. At each fold we keep one session for testing, and other sessions are used for training. Therefore, for each fold we can get one result, and we take the average of the results as the final result of our experiments.

We implement our model within the PyTorch framework and select the AdamW [18] optimizer for model optimization with a learning rate of $1\times 10^{-5}$ , where cross attention had 8 heads.

4 Results

Table 1 shows the performance of our method on audio-only, text-only, and multimodal (audio and text) emotion recognition tasks. Compared with a single modality, we simply concatenate the features of the two modalities and feed them into a downstream network constructed with a fully connected layer (FC), which improves the performance by about 6%. Further, we use the single-layer (K=1) multi-head cross-attention downstream network in Figure 3 for modality fusion, which achieves WA : 77.19%, UA : 78.47%. In the current state, we also verify the gains of Auxiliary Task 1 and Auxiliary Task 2, of which Auxiliary Task 2 has the better performance. We also try to use both auxiliary tasks with performance WA : 78.34%, UA : 79.59%.

Table 2 shows that when both auxiliary tasks are used simultaneously, the effect of multi-head cross-attention layer K on the performance of emotion recognition task. When K is 2, we get the best performance WA : 78.42%, UA : 79.71%. We found that with the introduction of auxiliary tasks, the overall training objective of the model became difficult to achieve. By appropriately increasing the number of layers in the downstream network, we could obtain better performance. However, due to the limited size of the IEMOCAP dataset, continuously increasing the number of network layers will make it difficult to fully train the network parameters, resulting in performance degradation. The performance of previous state-of-the-art multimodal models is mentioned in Table 3, and our proposed method has better performance than previous works.

5 Conclusion

In this paper, we propose to use wav2vec 2.0 and BERT as upstream network and K-layer downstream network based on multi-head cross-attention mechanism for multimodal emotion recognition task. In addition, we design two auxiliary tasks for the model to help the audio and text be fully integrated, and capture and align the features of emotion itself in different modalities. Finally our method outperforms the previous work on the 5-fold CV result of IEMOCAP, achieved the state-of-the-art, WA : 78.42%, UA : 79.71%.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Acheampong Francisca Adoma, Nunoo-Mensah Henry, and Wenyu Chen, “Comparative analyses of bert, roberta, distilbert, and xlnet for text-based emotion recognition,” in 2020 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP) . IEEE, 2020, pp. 117–121.
2[2] Francisca Adoma Acheampong, Henry Nunoo-Mensah, and Wenyu Chen, “Transformer models for text-based emotion detection: a review of bert-based approaches,” Artificial Intelligence Review , vol. 54, no. 8, pp. 5789–5829, 2021.
3[3] Peipei Shen, Zhou Changjun, and Xiong Chen, “Automatic speech emotion recognition using support vector machine,” in Proceedings of 2011 international conference on electronic & mechanical engineering and information technology . IEEE, 2011, vol. 2, pp. 621–625.
4[4] KV Krishna Kishore and P Krishna Satish, “Emotion recognition in speech using mfcc and wavelet features,” in 2013 3rd IEEE International Advance Computing Conference (IACC) . IEEE, 2013, pp. 842–847.
5[5] Shamane Siriwardhana, Andrew Reis, Rivindu Weerasekera, and Suranga Nanayakkara, “Jointly fine-tuning” bert-like” self supervised models to improve multimodal speech emotion recognition,” ar Xiv preprint ar Xiv:2008.06682 , 2020.
6[6] Mohammad Soleymani, Maja Pantic, and Thierry Pun, “Multimodal emotion recognition in response to videos,” IEEE transactions on affective computing , vol. 3, no. 2, pp. 211–223, 2011.
7[7] Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha, “M 3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues,” in Proceedings of the AAAI conference on artificial intelligence , 2020, vol. 34, pp. 1359–1367.
8[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” ar Xiv preprint ar Xiv:1810.04805 , 2018.