Multimodal Embeddings from Language Models

Shao-Yen Tseng; Panayiotis Georgiou; Shrikanth Narayanan

arXiv:1909.04302·cs.CL·September 11, 2019·1 cites

Multimodal Embeddings from Language Models

Shao-Yen Tseng, Panayiotis Georgiou, Shrikanth Narayanan

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multimodal language model that integrates audio and text to produce embeddings enriched with paralinguistic and affective information, improving emotion recognition performance.

Contribution

It presents a novel approach to incorporate acoustic data into pretrained language models, creating multimodal embeddings that enhance emotion recognition tasks.

Findings

01

Improved emotion recognition accuracy on CMU-MOSEI dataset.

02

Multimodal embeddings capture paralinguistic and affective cues.

03

Outperforms previous state-of-the-art multimodal models.

Abstract

Word embeddings such as ELMo have recently been shown to model word semantics with greater efficacy through contextualized learning on large-scale language corpora, resulting in significant improvement in state of the art across many natural language tasks. In this work we integrate acoustic information into contextualized lexical embeddings through the addition of multimodal inputs to a pretrained bidirectional language model. The language model is trained on spoken language that includes text and audio modalities. The resulting representations from this model are multimodal and contain paralinguistic information which can modify word meanings and provide affective information. We show that these multimodal embeddings can be used to improve over previous state of the art multimodal models in emotion recognition on the CMU-MOSEI dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shaoyent/multimodal-elmo
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Emotion and Mood Recognition

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM · Softmax · ELMo