Cascade-Free Mandarin Visual Speech Recognition via Semantic-Guided Cross-Representation Alignment

Lei Yang; Yi He; Fei Wu; Shilin Wang

arXiv:2603.21808·cs.CV·March 24, 2026

Cascade-Free Mandarin Visual Speech Recognition via Semantic-Guided Cross-Representation Alignment

Lei Yang, Yi He, Fei Wu, Shilin Wang

PDF

Open Access

TL;DR

This paper introduces a novel cascade-free Mandarin VSR model that uses semantic-guided alignment and multitask learning to improve accuracy and efficiency, overcoming limitations of traditional cascade architectures.

Contribution

It proposes a new cascade-free, multitask learning framework with semantic-guided contrastive loss for Mandarin VSR, eliminating error propagation and reducing inference latency.

Findings

01

Achieves superior recognition accuracy on public datasets.

02

Effectively balances inference speed and recognition performance.

03

Reduces error accumulation compared to cascade architectures.

Abstract

Chinese mandarin visual speech recognition (VSR) is a task that has advanced in recent years, yet still lags behind the performance on non-tonal languages such as English. One primary challenge arises from the tonal nature of Mandarin, which limits the effectiveness of conventional sequence-to-sequence modeling approaches. To alleviate this issue, existing Chinese VSR systems commonly incorporate intermediate representations, most notably pinyin, within cascade architectures to enhance recognition accuracy. While beneficial, in these cascaded designs, the subsequent stage during inference depends on the output of the preceding stage, leading to error accumulation and increased inference latency. To address these limitations, we propose a cascade-free architecture based on multitask learning that jointly integrates multiple intermediate representations, including phoneme and viseme, to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis