Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing

Zehua Liu; Xiaolou Li; Li Guo; Lantian Li; Dong Wang

arXiv:2506.02012·cs.CV·June 4, 2025

Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing

Zehua Liu, Xiaolou Li, Li Guo, Lantian Li, Dong Wang

PDF

Open Access

TL;DR

This paper investigates how to effectively leverage large language models in visual speech recognition, demonstrating that scaling, context-aware decoding, and iterative refinement significantly enhance recognition accuracy.

Contribution

It systematically explores LLM utilization in VSR, introducing scaling laws, context-aware decoding, and iterative polishing methods to improve performance.

Findings

01

LLM size positively correlates with VSR accuracy

02

Context-aware decoding improves recognition results

03

Iterative polishing reduces recognition errors

Abstract

Visual Speech Recognition (VSR) transcribes speech by analyzing lip movements. Recently, Large Language Models (LLMs) have been integrated into VSR systems, leading to notable performance improvements. However, the potential of LLMs has not been extensively studied, and how to effectively utilize LLMs in VSR tasks remains unexplored. This paper systematically explores how to better leverage LLMs for VSR tasks and provides three key contributions: (1) Scaling Test: We study how the LLM size affects VSR performance, confirming a scaling law in the VSR task. (2) Context-Aware Decoding: We add contextual text to guide the LLM decoding, improving recognition accuracy. (3) Iterative Polishing: We propose iteratively refining LLM outputs, progressively reducing recognition errors. Extensive experiments demonstrate that by these designs, the great potential of LLMs can be largely harnessed,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Multimodal Machine Learning Applications