CALM: Contrastive Aligned Audio-Language Multirate and Multimodal   Representations

Vin Sachidananda; Shao-Yen Tseng; Erik Marchi; Sachin Kajarekar,; Panayiotis Georgiou

arXiv:2202.03587·eess.AS·February 9, 2022·5 cites

CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations

Vin Sachidananda, Shao-Yen Tseng, Erik Marchi, Sachin Kajarekar,, Panayiotis Georgiou

PDF

Open Access

TL;DR

CALM introduces a contrastive, multirate approach to learn aligned audio and language representations, achieving competitive performance in emotion recognition with efficient training.

Contribution

The paper presents CALM, a novel method for aligning audio and lexical representations using contrastive learning and multirate processing within a pretrained language model.

Findings

01

Achieves 10-25% improvement in emotion recognition accuracy.

02

Aligns audio and language embeddings effectively with minimal training time.

03

Demonstrates benefits of multirate pretraining for multimodal tasks.

Abstract

Deriving multimodal representations of audio and lexical inputs is a central problem in Natural Language Understanding (NLU). In this paper, we present Contrastive Aligned Audio-Language Multirate and Multimodal Representations (CALM), an approach for learning multimodal representations using contrastive and multirate information inherent in audio and lexical inputs. The proposed model aligns acoustic and lexical information in the input embedding space of a pretrained language-only contextual embedding model. By aligning audio representations to pretrained language representations and utilizing contrastive information between acoustic inputs, CALM is able to bootstrap audio embedding competitive with existing audio representation models in only a few hours of training time. Operationally, audio spectrograms are processed using linearized patches through a Spectral Transformer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Absolute Position Encodings · Softmax · Byte Pair Encoding · Layer Normalization · Dropout · Label Smoothing