Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training   and Unimodal Deployment

Joanna Hong; Sanjeel Parekh; Honglie Chen; Jacob Donley; Ke Tan; Buye; Xu; Anurag Kumar

arXiv:2501.18157·cs.SD·January 31, 2025

Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment

Joanna Hong, Sanjeel Parekh, Honglie Chen, Jacob Donley, Ke Tan, Buye, Xu, Anurag Kumar

PDF

Open Access

TL;DR

This paper introduces MUTUD, a framework that enables effective audiovisual speech processing by training with multiple modalities but deploying with fewer, reducing computational costs and maintaining high performance.

Contribution

The paper presents a novel MUTUD framework with TAME module that estimates missing modality information, improving unimodal deployment performance while leveraging multimodal training.

Findings

01

MUTUD reduces the performance gap between multimodal and unimodal models.

02

It achieves up to 80% reduction in model size and compute.

03

Effective across various audiovisual speech tasks.

Abstract

Building reliable speech systems often requires combining multiple modalities, like audio and visual cues. While such multimodal solutions frequently lead to improvements in performance and may even be critical in certain cases, they come with several constraints such as increased sensory requirements, computational cost, and modality synchronization, to mention a few. These challenges constrain the direct uses of these multimodal solutions in real-world applications. In this work, we develop approaches where the learning happens with all available modalities but the deployment or inference is done with just one or reduced modalities. To do so, we propose a Multimodal Training and Unimodal Deployment (MUTUD) framework which includes a Temporally Aligned Modality feature Estimation (TAME) module that can estimate information from missing modality using modalities present during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media