TL;DR
This paper introduces an Audio-Visual Language Model that incorporates full-face visual cues into speech generation, significantly improving emotional expressiveness and recognition in dialogue systems.
Contribution
It presents a novel multimodal framework integrating visual cues into speech models, demonstrating improved emotion recognition and expressive speech synthesis.
Findings
+5 F1 in emotion recognition
Enhanced speech expressiveness with visual cues
Foundation for multimodal conversational systems
Abstract
We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
