CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration
Xiefan Guo, Xinzhu Ma, Haiyu Zhang, Di Huang

TL;DR
This paper introduces Cross-Timestep Self-Calibration (CTCal), a method that improves text-image alignment in diffusion models by leveraging reliable early-timestep attention maps to guide training at noisier, later timesteps.
Contribution
The paper proposes a novel calibration technique that enhances text-to-image diffusion models' alignment accuracy by explicitly supervising representations across timesteps, applicable to various existing models.
Findings
Improved text-image alignment on benchmarks
Enhanced model generalization and robustness
Seamless integration with existing diffusion models
Abstract
Recent advancements in text-to-image synthesis have been largely propelled by diffusion-based models, yet achieving precise alignment between text prompts and generated images remains a persistent challenge. We find that this difficulty arises primarily from the limitations of conventional diffusion loss, which provides only implicit supervision for modeling fine-grained text-image correspondence. In this paper, we introduce Cross-Timestep Self-Calibration (CTCal), founded on the supporting observation that establishing accurate text-image alignment within diffusion models becomes progressively more difficult as the timestep increases. CTCal leverages the reliable text-image alignment (i.e., cross-attention maps) formed at smaller timesteps with less noise to calibrate the representation learning at larger timesteps with more noise, thereby providing explicit supervision during…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. **Clear problem framing:** Directly targets cross-timestep misalignment. 2. **Simple supervision signal:** Reuses model-internal cross-attention as a self-supervised ``teacher". 3. **Comprehensive design:** Pixel + semantic attention alignment and subject balancing are sensible; timestep-aware weighting matches the difficulty profile. 4. **Empirical gains:** Consistent improvements on compositional/prompt-following metrics.
1. **Diversity risk from attention supervision.** Using small-timestep attention to shape large-timestep behavior might bias the model toward more deterministic layouts and reduce output diversity. I wonder is there a drop on diversity or mode collapse in generation. An metric analysis or visualization result may help. 2. **Dataset construction and generalization.** Training data is curated from T2I-CompBench-like prompts via reward-driven selection. This raises concerns about overfit
1. While a lot of prior works focus on improving image-text alignment during inference, its interesting to see this paper talk about providing explicit supervision for modeling fine-grained text-image correspondence during training instead. 2. The main idea of Cross-Timestep Self-Calibration is novel. It moves beyond conventional losses by introducing a self-supervised signal derived internally from the model's behavior at different levels of noise.
1. The method seems to add computational complexity, and the qualitative results do not seem strong enough to suggest utility of the proposed approach. For example, in the first half of Figure 4, the jar in the 5th column is just floating in the air, the banana in the 6th column looks unnatural and there is leakage of green to the banana. 2. The authors choose $t_{tea}=0$ in the final setup, but used t_{tea}=1 while motivating the overall approach in figure 1. I wonder whether t_{tea}=0 would gi
1. Targeting a validated issue: It addresses the measurable degradation of text-image alignment with increasing diffusion timesteps, supported by cross-attention map visualizations . 2. Model agnosticism: It seamlessly integrates into diverse text-to-image diffusion models, including diffusion-based (e.g., SD 2.1) and flow-based (e.g., SD 3) approaches . 3. Comprehensive validation: It is rigorously tested on T2I-CompBench++/GenEval benchmarks and user studies, with no trade-offs in image dive
1. Limited novelty. It is essentially an integration of existing techniques rather than a breakthrough: using cross-attention for alignment, filtering non-semantic tokens, and combining multi-loss terms are all well-explored in prior diffusion model optimization works. The token mapping in the attention map has been well-explored in previous works, either during inference process or training process. For example, , Dreamo[1] explores routing constraints in DiT structure to distinguish multiple s
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Machine Learning in Materials Science
