Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering
Xu Wang, Shengeng Tang, Fei Wang, Lechao Cheng, Dan Guo, Feng Xue, Richang Hong

TL;DR
Text2Lip introduces a viseme-guided framework for generating realistic talking face videos from text, overcoming audio dependence and ambiguity issues through structured viseme sequences and progressive learning.
Contribution
The paper presents a novel viseme-centric approach with a curriculum learning strategy for robust, controllable lip-synced face generation from text, independent of high-quality audio data.
Findings
Outperforms existing methods in semantic fidelity and realism.
Effective in both audio-present and audio-free scenarios.
Achieves accurate lip synchronization with photorealistic rendering.
Abstract
Generating semantically coherent and visually accurate talking faces requires bridging the gap between linguistic meaning and facial articulation. Although audio-driven methods remain prevalent, their reliance on high-quality paired audio visual data and the inherent ambiguity in mapping acoustics to lip motion pose significant challenges in terms of scalability and robustness. To address these issues, we propose Text2Lip, a viseme-centric framework that constructs an interpretable phonetic-visual bridge by embedding textual input into structured viseme sequences. These mid-level units serve as a linguistically grounded prior for lip motion prediction. Furthermore, we design a progressive viseme-audio replacement strategy based on curriculum learning, enabling the model to gradually transition from real audio to pseudo-audio reconstructed from enhanced viseme features via cross-modal…
Peer Reviews
Decision·Submitted to ICLR 2026
- Novel Training Strategy (PVAR): The method of progressively replacing real audio with viseme-derived pseudo-audio allows to generate only based on text. - SOTA results: The model presents quantitative visual quality (SSIM, FID) and lip-sync scores (Sync-C) that are comparable or superior to SOTA audio-driven models.
- The renderer is too central in metrics: The paper claims SOTA visual quality (SSIM, PSNR, FID) but uses a SOTA renderer (EchoMimic) as its final stage. This is a major confounding variable. These metrics are evaluating EchoMimic's rendering power, not just Text2Lip's landmark generation. The model's actual contribution (lip-sync/landmark quality) is not SOTA (Table 2). - Central Motivation is Unproven: The entire paper is based on solving "audio-to-lip ambiguity" (e.g., "bad boy" vs. "bat boa
1. Speech-driven models often learn ambiguous mappings from audio to lip shapes. To address the issue, the paper explicitly models the linguistic-phonetic-visual hierarchy instead solely on audio, which serves as semantically grounded priors for facial motion synthesis. 2. Text2lip surpasses other sota methods in the visual quality and semantic quality 3. Text2lip introduce a curriculum-based viseme-audio replacement strategy that facilitates flexible modality handling, supporting both audio-d
1. Although using visemes instead of speech can resolve the ambiguity of the mapping, it may, at the same time, affect synchronization and rhythmic cadence (the results for sync-c and sync-d in Table 1 do not appear to be the best). 2. If the speech itself carries emotion, and it is a complex, changing emotion, will solely using text as input affect performance? Regarding this point, is it possible to conduct testing on an emotional dataset?
The paper presents a clear problem formulation and addresses an underexplored direction of generating lip motion directly from text. The overall framework is conceptually well structured and the pipeline is easy to follow. The integration of viseme-level modeling provides an interesting intermediate representation between linguistic and visual domains.
The proposed multi-stage design (text to viseme to pseudo-audio to landmark to renderer) appears unnecessarily complex and may accumulate errors across stages without clear justification or analysis of each component’s necessity. Given the maturity of current text-to-speech (TTS) systems and high-performing audio-driven video generators, a natural question arises: why not decompose the task into a more straightforward two-stage pipeline, which is text-to-speech followed by speech-driven talking
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
