Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Weiting Tan; Jiachen Lian; Hirofumi Inaguma; Paden Tomasello; Philipp Koehn; Xutai Ma

arXiv:2508.16188·cs.CL·August 29, 2025

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Weiting Tan, Jiachen Lian, Hirofumi Inaguma, Paden Tomasello, Philipp Koehn, Xutai Ma

PDF

1 Video

TL;DR

This paper introduces an Audio-Visual Language Model that incorporates full-face visual cues into speech generation, significantly improving emotional expressiveness and recognition in dialogue systems.

Contribution

It presents a novel multimodal framework integrating visual cues into speech models, demonstrating improved emotion recognition and expressive speech synthesis.

Findings

01

+5 F1 in emotion recognition

02

Enhanced speech expressiveness with visual cues

03

Foundation for multimodal conversational systems

Abstract

We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation· underline