VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance
Chang Sun, Dongliang Xie, Wanpeng Xie, Bo Qin, Hong Yang

TL;DR
VALLR-Pin introduces a two-stage Mandarin visual speech recognition framework that leverages Pinyin as an intermediate representation and uses LLM-based refinement to improve accuracy amid homophones and viseme ambiguity.
Contribution
The paper presents a novel VSR architecture that explicitly incorporates Pinyin and employs LLM refinement, enhancing Mandarin speech recognition accuracy.
Findings
Improved transcription accuracy on public benchmarks.
Effective handling of homophones and viseme ambiguity.
Robust performance in multi-speaker scenarios.
Abstract
Visual speech recognition (VSR) aims to transcribe spoken content from silent lip-motion videos and is particularly challenging in Mandarin due to severe viseme ambiguity and pervasive homophones. We propose VALLR-Pin, a two-stage Mandarin VSR framework that extends the VALLR architecture by explicitly incorporating Pinyin as an intermediate representation. In the first stage, a shared visual encoder feeds dual decoders that jointly predict Mandarin characters and their corresponding Pinyin sequences, encouraging more robust visual-linguistic representations. In the second stage, an LLM-based refinement module takes the predicted Pinyin sequence together with an N-best list of character hypotheses to resolve homophone-induced ambiguities. To further adapt the LLM to visual recognition errors, we fine-tune it on synthetic instruction data constructed from model-generated Pinyin-text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Phonetics and Phonology Research
