VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance

Chang Sun; Dongliang Xie; Wanpeng Xie; Bo Qin; Hong Yang

arXiv:2512.20032·cs.CV·December 30, 2025

VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance

Chang Sun, Dongliang Xie, Wanpeng Xie, Bo Qin, Hong Yang

PDF

Open Access

TL;DR

VALLR-Pin introduces a two-stage Mandarin visual speech recognition framework that leverages Pinyin as an intermediate representation and uses LLM-based refinement to improve accuracy amid homophones and viseme ambiguity.

Contribution

The paper presents a novel VSR architecture that explicitly incorporates Pinyin and employs LLM refinement, enhancing Mandarin speech recognition accuracy.

Findings

01

Improved transcription accuracy on public benchmarks.

02

Effective handling of homophones and viseme ambiguity.

03

Robust performance in multi-speaker scenarios.

Abstract

Visual speech recognition (VSR) aims to transcribe spoken content from silent lip-motion videos and is particularly challenging in Mandarin due to severe viseme ambiguity and pervasive homophones. We propose VALLR-Pin, a two-stage Mandarin VSR framework that extends the VALLR architecture by explicitly incorporating Pinyin as an intermediate representation. In the first stage, a shared visual encoder feeds dual decoders that jointly predict Mandarin characters and their corresponding Pinyin sequences, encouraging more robust visual-linguistic representations. In the second stage, an LLM-based refinement module takes the predicted Pinyin sequence together with an N-best list of character hypotheses to resolve homophone-induced ambiguities. To further adapt the LLM to visual recognition errors, we fine-tune it on synthetic instruction data constructed from model-generated Pinyin-text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Phonetics and Phonology Research