Multi-modal embeddings using multi-task learning for emotion recognition
Aparna Khare, Srinivas Parthasarathy, Shiva Sundaram

TL;DR
This paper introduces multi-modal embeddings generated via multi-task learning with transformer encoders, combining audio, visual, and textual data to enhance emotion recognition performance.
Contribution
It extends natural language embeddings to multi-modal architectures trained on multiple tasks, improving emotion recognition accuracy.
Findings
Embeddings outperform previous state-of-the-art results on CMU-MOSEI.
Multi-task training enhances the quality of multi-modal embeddings.
The approach effectively integrates audio, visual, and textual information.
Abstract
General embeddings like word2vec, GloVe and ELMo have shown a lot of success in natural language tasks. The embeddings are typically extracted from models that are built on general tasks such as skip-gram models and natural language generation. In this paper, we extend the work from natural language understanding to multi-modal architectures that use audio, visual and textual information for machine learning tasks. The embeddings in our network are extracted using the encoder of a transformer model trained using multi-task training. We use person identification and automatic speech recognition as the tasks in our embedding generation framework. We tune and evaluate the embeddings on the downstream task of emotion recognition and demonstrate that on the CMU-MOSEI dataset, the embeddings can be used to improve over previous state of the art results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM · Softmax · ELMo · GloVe Embeddings
