GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Bozhou Li; Sihan Yang; Yushuo Guan; Ruichuan An; Xinlong Chen; Yang Shi; Pengfei Wan; Wentao Zhang; Yuanxing zhang

arXiv:2512.15560·cs.CV·December 30, 2025

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang

PDF

Open Access

TL;DR

GRAN-TED introduces a new benchmark and training paradigm for creating robust, aligned, and nuanced text embeddings that significantly improve the performance and efficiency of diffusion models in text-to-image and text-to-video generation.

Contribution

The paper presents TED-6K, a fast and reliable evaluation benchmark, and a novel two-stage training method for superior text encoders tailored for diffusion models.

Findings

01

TED-6K correlates strongly with downstream task performance.

02

Evaluating with TED-6K is approximately 750 times faster than end-to-end training.

03

GRAN-TED achieves state-of-the-art results on TED-6K and improves generation quality.

Abstract

The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning