Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering

Xu Wang; Shengeng Tang; Fei Wang; Lechao Cheng; Dan Guo; Feng Xue; Richang Hong

arXiv:2508.02362·cs.CV·August 5, 2025

Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering

Xu Wang, Shengeng Tang, Fei Wang, Lechao Cheng, Dan Guo, Feng Xue, Richang Hong

PDF

Open Access 3 Reviews

TL;DR

Text2Lip introduces a viseme-guided framework for generating realistic talking face videos from text, overcoming audio dependence and ambiguity issues through structured viseme sequences and progressive learning.

Contribution

The paper presents a novel viseme-centric approach with a curriculum learning strategy for robust, controllable lip-synced face generation from text, independent of high-quality audio data.

Findings

01

Outperforms existing methods in semantic fidelity and realism.

02

Effective in both audio-present and audio-free scenarios.

03

Achieves accurate lip synchronization with photorealistic rendering.

Abstract

Generating semantically coherent and visually accurate talking faces requires bridging the gap between linguistic meaning and facial articulation. Although audio-driven methods remain prevalent, their reliance on high-quality paired audio visual data and the inherent ambiguity in mapping acoustics to lip motion pose significant challenges in terms of scalability and robustness. To address these issues, we propose Text2Lip, a viseme-centric framework that constructs an interpretable phonetic-visual bridge by embedding textual input into structured viseme sequences. These mid-level units serve as a linguistically grounded prior for lip motion prediction. Furthermore, we design a progressive viseme-audio replacement strategy based on curriculum learning, enabling the model to gradually transition from real audio to pseudo-audio reconstructed from enhanced viseme features via cross-modal…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- Novel Training Strategy (PVAR): The method of progressively replacing real audio with viseme-derived pseudo-audio allows to generate only based on text. - SOTA results: The model presents quantitative visual quality (SSIM, FID) and lip-sync scores (Sync-C) that are comparable or superior to SOTA audio-driven models.

Weaknesses

- The renderer is too central in metrics: The paper claims SOTA visual quality (SSIM, PSNR, FID) but uses a SOTA renderer (EchoMimic) as its final stage. This is a major confounding variable. These metrics are evaluating EchoMimic's rendering power, not just Text2Lip's landmark generation. The model's actual contribution (lip-sync/landmark quality) is not SOTA (Table 2). - Central Motivation is Unproven: The entire paper is based on solving "audio-to-lip ambiguity" (e.g., "bad boy" vs. "bat boa

Reviewer 02Rating 6Confidence 4

Strengths

1. Speech-driven models often learn ambiguous mappings from audio to lip shapes. To address the issue, the paper explicitly models the linguistic-phonetic-visual hierarchy instead solely on audio, which serves as semantically grounded priors for facial motion synthesis. 2. Text2lip surpasses other sota methods in the visual quality and semantic quality 3. Text2lip introduce a curriculum-based viseme-audio replacement strategy that facilitates flexible modality handling, supporting both audio-d

Weaknesses

1. Although using visemes instead of speech can resolve the ambiguity of the mapping, it may, at the same time, affect synchronization and rhythmic cadence (the results for sync-c and sync-d in Table 1 do not appear to be the best). 2. If the speech itself carries emotion, and it is a complex, changing emotion, will solely using text as input affect performance? Regarding this point, is it possible to conduct testing on an emotional dataset?

Reviewer 03Rating 6Confidence 3

Strengths

The paper presents a clear problem formulation and addresses an underexplored direction of generating lip motion directly from text. The overall framework is conceptually well structured and the pipeline is easy to follow. The integration of viseme-level modeling provides an interesting intermediate representation between linguistic and visual domains.

Weaknesses

The proposed multi-stage design (text to viseme to pseudo-audio to landmark to renderer) appears unnecessarily complex and may accumulate errors across stages without clear justification or analysis of each component’s necessity. Given the maturity of current text-to-speech (TTS) systems and high-performing audio-driven video generators, a natural question arises: why not decompose the task into a more straightforward two-stage pipeline, which is text-to-speech followed by speech-driven talking

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis