ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation
Sanghyun Jo, Wooyeol Lee, Ziseok Lee, Kyungsu Kim

TL;DR
This paper introduces ISAC, a training-free, model-agnostic method that improves multi-object image generation by controlling instance boundaries and semantics through hierarchical attention, leading to more accurate and consistent multi-instance outputs.
Contribution
ISAC is a novel, training-free approach that enhances multi-instance image generation by hierarchical attention control, improving instance accuracy without additional training or external models.
Findings
Achieves at least 50% improvement in multi-class accuracy on IntraCompBench.
Improves multi-instance accuracy by 7% on IntraCompBench.
Strengthens layout-to-image controllers with refined dense instance masks.
Abstract
Text-to-image diffusion models have recently become highly capable, yet their behavior in multi-object scenes remains unreliable: models often produce an incorrect number of instances and exhibit semantics leaking across objects. We trace these failures to vague instance boundaries; self-attention already reveals instance layouts early in the denoising process, but existing approaches act only on semantic signals. We introduce (nstance-to-emantic ttention ontrol), a training-free, model-agnostic objective that performs hierarchical attention control by first carving out instance layouts from self-attention and then binding semantics to these instances. In Phase 1, ISAC clusters self-attention into the number of instances and repels overlaps, establishing an instance-level structural hierarchy; in Phase 2, it injects these…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The proposed approach effectively addresses common failures in multi-object generation by decoupling instance formation from semantic binding, enabling more accurate and coherent image synthesis in complex scenes. The proposed method is model-agnostic and can be complementary to fine-tuned models and also enhance existing layout-guided models. The paper is well-structured, and the supplementary material contributes meaningfully to both understanding and reproducibility of the proposed approach
While the paper introduces some interesting ideas, its contribution is somewhat limited by the fact that object count and layout consistency in multi-object image generation are already being actively explored in the literature. Given the growing number of existing approaches, the novelty of the proposed method appears incremental, and broader comparisons with related works could further clarify its distinct advantages. The conclusion section is relatively underdeveloped; it lacks a thorough ar
1. Based on the observation that in the denoising process of diffusion models, spatial instance structures emerge before clear semantics materialize, the proposed stage-wise separation algorithm is more reasonable and better aligned with this generative process compared with prior work. 2. The effectiveness of the method is validated across multiple popular text-to-image models (e.g., SD1, SD2, SD3, SDXL, PixArt), showing consistent improvements in multi-instance generation quality.
1. Compared with the original approach, the proposed method may introduce additional computational overhead. The inclusion of latent optimization and VLM models requires larger VRAM and increases inference time. 2. The work lacks comparative experiments with LLM + Layout methods.
The proposed method works across UNet and Diffusion Transformer models and can be combined with layout-guided methods. It introduces a new Maximum pixel-wise Overlap criterion to enforce mutually exclusive boundaries between instances based on the observation that instance structures form early in the diffusion timesteps.
The evaluation of the proposed method is insufficient contrary to the title and claims made in the paper. It needs to be evaluated with other similar instance generation methods that can handle more complex tasks such as MIGC and InstanceDiffusion since the proposed method can be combined with those. The multi-instance evaluation setup is weak as it only uses instances of the same class and does not evaluate multiple instances of two classes such as “A photo of two dogs and three cats” In L21
1. Methodological novelty: ISAC introduces a dynamic-aware two-phase objective that decouples instance structure formation and semantic binding, effectively addressing missing or merged instances in multi-instance generation. 2. Technical completeness: The method includes global foreground mask creation, K-means clustering for instance structure, Maximum Pixel-wise Overlap (MPO) for strict separation, and semantic binding, making it a comprehensive and practical approach. It can be applied as a
1. The description of the method in the paper needs to be clearer. 2. The examples presented in the paper are relatively limited; they are mostly cats and dogs, with only a few examples of other categories (SD1.5) shown in the appendix. 3. This training-free approach generally tends to degrade image quality, and the method lacks evaluation of image quality. 4. The method introduces significant time and memory overhead, increasing inference memory from 23GB to 75GB and substantially prolonging
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Reinforcement Learning in Robotics
