Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
Jeong Hun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro

TL;DR
This paper introduces VSP-LLM, a framework that enhances visual speech processing by integrating large language models for improved context understanding, multi-task performance, and efficient training with limited data.
Contribution
The paper presents a novel VSP-LLM framework that combines visual speech recognition and translation with LLMs, employing deduplication and LoRA for efficient training.
Findings
VSP-LLM outperforms recent models in lip translation tasks.
Effective with only 30 hours of labeled data.
Reduces computational cost through deduplication and LoRA.
Abstract
In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of an LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems
