Video Text Preservation with Synthetic Text-Rich Videos
Ziyang Liu, Kevin Valencia, Justin Cui

TL;DR
This paper presents a lightweight synthetic supervision method that enhances text legibility and consistency in Text-To-Video models by using synthetic text-rich images and videos for fine-tuning.
Contribution
It introduces a novel approach combining synthetic text-rich images and videos to improve T2V models without architectural changes.
Findings
Improved short-text legibility in generated videos.
Enhanced temporal consistency for longer texts.
Synthetic supervision effectively boosts textual fidelity.
Abstract
While Text-To-Video (T2V) models have advanced rapidly, they continue to struggle with generating legible and coherent text within videos. In particular, existing models often fail to render correctly even short phrases or words and previous attempts to address this problem are computationally expensive and not suitable for video generation. In this work, we investigate a lightweight approach to improve T2V diffusion models using synthetic supervision. We first generate text-rich images using a text-to-image (T2I) diffusion model, then animate them into short videos using a text-agnostic image-to-video (I2v) model. These synthetic video-prompt pairs are used to fine-tune Wan2.1, a pre-trained T2V model, without any architectural changes. Our results show improvement in short-text legibility and temporal consistency with emerging structural priors for longer text. These findings suggest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Multimodal Machine Learning Applications
