T2I-ConBench: Text-to-Image Benchmark for Continual Post-training

Zhehao Huang; Yuhang Liu; Yixin Lou; Zhengbao He; Mingzhen He; Wenxing Zhou; Tao Li; Kehan Li; Zeyi Huang; Xiaolin Huang

arXiv:2505.16875·cs.CV·May 23, 2025

T2I-ConBench: Text-to-Image Benchmark for Continual Post-training

Zhehao Huang, Yuhang Liu, Yixin Lou, Zhengbao He, Mingzhen He, Wenxing Zhou, Tao Li, Kehan Li, Zeyi Huang, Xiaolin Huang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces T2I-ConBench, a comprehensive benchmark for evaluating continual post-training of text-to-image models, addressing challenges like forgetting and generalization across practical scenarios.

Contribution

It provides a standardized evaluation protocol, benchmarks multiple methods, and highlights unresolved issues in continual post-training for text-to-image models.

Findings

01

No method excels across all evaluation dimensions.

02

Joint training does not always outperform continual methods.

03

Cross-task generalization remains a significant challenge.

Abstract

Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- Continual post-training is practically important but lacks standardized evaluation. This benchmark provides much-needed infrastructure for fair comparison and reproducible research. - The four-dimensional assessment is thorough and well-motivated. This holistic view goes beyond typical continual learning benchmarks. - Testing whether models can compose concepts from different tasks (e.g., "astronaut riding horse" after learning "astronaut" and "horse riding" separately) is creative and importa

Weaknesses

- The benchmark includes only 4 items and 2 domains, which is insufficient for evaluating continual learning at scale. Real-world scenarios involve many more tasks. Expanding to at least 8-10 items and 4-5 domains would significantly strengthen the benchmark's validity and challenge methods more realistically. The task types are also limited to only customization and enhancement, other important scenarios like style transfer or concept editing are missing. - Domain enhancement relies entirely on

Reviewer 02Rating 4Confidence 3

Strengths

1. This work develops an automated evaluation pipeline to assess preservation of pretrained generality, target-task performance, forgetting, and cross-task generalization for continual T2I post-training. 2. Building upon the proposed T2I-ConBench, this paper evaluates ten representative baseline methods on mixed order streams.

Weaknesses

1. In Table 2, it’s surprising that the replay method yields a zero performance for Unique-Sim in Order 2 training. Is there any insight regarding this? 2. The findings summarized in the work do not offer particularly novel or compelling insights, as most of them were straightforward or previously pointed out.

Reviewer 03Rating 6Confidence 3

Strengths

1. A new and comprehensive post-training evaluation framework for continuous learning diffusion models is necessary. 2. The proposed evaluation aspects are diverse, adding generality retention and cross-task generalization to the forgetting aspect, which is the primary focus of continuous learning, thus meeting current requirements for AI.

Weaknesses

1. The dataset seems to have some limitations in size and diversity; for example, the tasks only include dogs, cats, and sneakers. This may lead to shortcomings in evaluation.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion