Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and   Context-Aware Visual Speech Processing

Jeong Hun Yeo; Seunghee Han; Minsu Kim; Yong Man Ro

arXiv:2402.15151·cs.CV·May 15, 2024·2 cites

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

Jeong Hun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces VSP-LLM, a framework that enhances visual speech processing by integrating large language models for improved context understanding, multi-task performance, and efficient training with limited data.

Contribution

The paper presents a novel VSP-LLM framework that combines visual speech recognition and translation with LLMs, employing deduplication and LoRA for efficient training.

Findings

01

VSP-LLM outperforms recent models in lip translation tasks.

02

Effective with only 30 hours of labeled data.

03

Reduces computational cost through deduplication and LoRA.

Abstract

In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of an LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sally-sh/vsp-llm
pytorchOfficial

Videos

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing· underline

Taxonomy

TopicsSpeech and dialogue systems