Multimodal Embeddings from Language Models
Shao-Yen Tseng, Panayiotis Georgiou, Shrikanth Narayanan

TL;DR
This paper introduces a multimodal language model that integrates audio and text to produce embeddings enriched with paralinguistic and affective information, improving emotion recognition performance.
Contribution
It presents a novel approach to incorporate acoustic data into pretrained language models, creating multimodal embeddings that enhance emotion recognition tasks.
Findings
Improved emotion recognition accuracy on CMU-MOSEI dataset.
Multimodal embeddings capture paralinguistic and affective cues.
Outperforms previous state-of-the-art multimodal models.
Abstract
Word embeddings such as ELMo have recently been shown to model word semantics with greater efficacy through contextualized learning on large-scale language corpora, resulting in significant improvement in state of the art across many natural language tasks. In this work we integrate acoustic information into contextualized lexical embeddings through the addition of multimodal inputs to a pretrained bidirectional language model. The language model is trained on spoken language that includes text and audio modalities. The resulting representations from this model are multimodal and contain paralinguistic information which can modify word meanings and provide affective information. We show that these multimodal embeddings can be used to improve over previous state of the art multimodal models in emotion recognition on the CMU-MOSEI dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Emotion and Mood Recognition
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM · Softmax · ELMo
