InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models
Sarah Rastegar, Violeta Chatalbasheva, Sieger Falkena, Anuj Singh, Yanbo Wang, Tejas Gokhale, Hamid Palangi, Hadi Jamali-Rad

TL;DR
InfSplign is a training-free, inference-time method that enhances spatial accuracy in text-to-image diffusion models by adjusting noise based on cross-attention maps, outperforming existing baselines.
Contribution
We introduce InfSplign, a lightweight, plug-and-play inference-time approach that improves spatial alignment without additional training or fine-tuning.
Findings
Achieves state-of-the-art spatial alignment on VISOR and T2I-CompBench.
Outperforms existing inference-time baselines and fine-tuning methods.
Compatible with any diffusion backbone.
Abstract
Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
+ The method is training-free and requires no extra inputs, making it easy to deploy and compatible with various diffusion backbones. + It achieves competitive results on standard benchmarks, significantly outperforming other inference-time methods. + The hierarchical use of attention maps and the introduction of variance as a measure of uncertainty are thoughtful design choices.
- The method introduces several hyperparameters (e.g., α, m, λs, λp, λb, η). While a grid search was conducted, their generalizability and robustness across different datasets or models are not fully verified, potentially affecting usability. - The approach primarily focuses on binary spatial relationships between two objects. Its applicability to more complex spatial layouts or scenes with multiple objects is not deeply explored, limiting its scope. - The experiments are conducted exclusively
- The proposed method is training-free and can be integrated into existing models as a plug-and-play module, which can be a good practical method for U-Net based T2I diffusion models. - The paper includes extensive ablation studies that thoroughly investigate the contributions of different components, particularly the loss hyperparameters.
- The experimental validation is limited to relatively older U-Net architectures (SD1.4, SD2.1). It is unclear whether the spatial alignment issues addressed persist in more recent, state-of-the-art models, and whether the proposed method remains effective on them. - The method's design is heavily focused on the U-Net architecture, which has shifted to transformer-based backbones such as DiT and MMDiT (e.g., in SD3). These newer architectures employ different attention mechanisms (e.g., joint at
- The method is simple, effective and plug-and play. Since the method doesn't require any retraining, it can be applied to Unet-based models. - Spatial placement (centroid margin), object dropping (variance minimization), and overshadowing (variance parity) map cleanly onto observed issues. - VISOR and T2I-CompBench numbers are consistently higher than both inference-time and fine-tuning-based baselines.
- The spatial loss encodes a fixed set of relations (left/right/above/below/near) via axis-wise centroid differences with a margin. This is likely tuned to VISOR and does not cover richer or contextual relations (between, around, on, behind/in-front-of). - The method assumes the caption can be parsed cleanly into ⟨A,R,B⟩ and only enforces a single binary relation at a time. The paper does not study multi-object scenes with multiple simultaneous constraints, where losses could conflict. - The con
- The formulation is intuitive and mathematically grounded in attention-space statistics. - Outperform in both inference-time and fine-tuning methods on major spatial reasoning benchmarks. - Ablation results demonstrate that all three loss components contribute meaningfully.
- The paper lacks a convincing justification that cross-attention variance is a valid proxy for object presence and representation balance; while the method computes centroids and variances from decoder cross-attention, it does not establish (theoretically or empirically) that lower variance reliably implies preserved objects or balanced representations, which calls for direct evidence. - The paper lacks a clean isolation of spatial-alignment gains relative to STORM; results are reported only fo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
