Cross-stitched Multi-modal Encoders

Karan Singla; Daniel Pressel; Ryan Price; Bhargav Srinivas Chinnari,; Yeon-Jun Kim; Srinivas Bangalore

arXiv:2204.09227·cs.CL·April 21, 2022

Cross-stitched Multi-modal Encoders

Karan Singla, Daniel Pressel, Ryan Price, Bhargav Srinivas Chinnari,, Yeon-Jun Kim, Srinivas Bangalore

PDF

Open Access

TL;DR

This paper introduces a compact, resource-efficient multi-modal encoder architecture that combines speech and text inputs using cross-modal attention, enabling improved classification and prediction tasks.

Contribution

It presents a novel cross-stitched multi-modal encoder architecture that effectively fuses speech and text modalities using multi-headed attention, trained efficiently on a single GPU.

Findings

01

Multi-headed attention fusion outperforms simple concatenation.

02

The architecture captures both acoustic-prosodic and lexical information.

03

Model is compact and resource-efficient, suitable for single GPU training.

Abstract

In this paper, we propose a novel architecture for multi-modal speech and text input. We combine pretrained speech and text encoders using multi-headed cross-modal attention and jointly fine-tune on the target problem. The resultant architecture can be used for continuous token-level classification or utterance-level prediction acting on simultaneous text and speech. The resultant encoder efficiently captures both acoustic-prosodic and lexical information. We compare the benefits of multi-headed attention-based fusion for multi-modal utterance-level classification against a simple concatenation of pre-pooled, modality-specific representations. Our model architecture is compact, resource efficient, and can be trained on a single consumer GPU card.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems