AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization
Binhe Yu, Zhen Wang, Kexin Li, Yuqian Yuan, Wenqiao Zhang, Long Chen, Juncheng Li, Jun Xiao, Yueting Zhuang

TL;DR
AnyMS is a training-free framework that uses bottom-up attention decoupling to effectively synthesize multi-subject images guided by layout, text, and subject images, balancing alignment, identity, and layout control.
Contribution
It introduces a novel bottom-up dual-level attention decoupling mechanism for layout-guided multi-subject customization without additional training.
Findings
Achieves state-of-the-art performance in multi-subject image synthesis.
Supports complex compositions with multiple subjects.
Operates without subject-specific training or tuning.
Abstract
Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper is clearly written and easy to follow. 2. The figures and tables are clear and easy to understand. 3. The authors use many formulas to help readers better understand the concepts.
1. The motivation for the global decoupling design is insufficient. In lines 246-251, the authors claim that concatenating the subject token with the text prompt leads to various issues. However, this claim lacks persuasiveness as the root causes of these issues are not adequately analyzed (e.g. attention map). This further undermines the motivation for the global decoupling design. If I understand correctly, this component is essentially an IP-Adapter, which reinforces the concern about the lac
- This problem of generation of layout-guided and training-free multi-subject customization is an impoartant task. - The writing is overall clear and easy to follow.
- Incremental contribution: The work lies within a line of research that includes numerous recent studies on text alignment, ID preservation, and layout control. This work is particularly in the context of training-free methods based on attention manipulation. Thus, the novelty appears incremental given the existing works. - Questionable trade-off between tasks: The assumption that text alignment, subject identity preservation, and layout control inherently require a performance trade-off is deb
1. The results for flexibly composing a varying number of subjects are very impressive. 2. The core innovations and contributions are presented clearly and effectively.
1. The core contribution of the paper is the "local decoupling" attention mechanism. However, the experimental section lacks a detailed analysis of it. Specifically, the paper would be strengthened by a comparison with other methods that also focus on improving attention mechanisms for customized generation, as well as visualizations of the proposed attention maps to demonstrate its effectiveness. 2. There appears to be a significant overlap in the information presented in Figure 4 and Table 1,
Overall, the paper has a clear and practical focus on layout guided multi subject customization. The design goes straight at the usual failure modes: identity leakage, boundary spillover, and layout drift. The dual level attention decoupling runs only at inference and, paired with a pretrained image adapter, gives a training free pipeline that keeps untouched regions stable while enforcing per box control. Empirically, it improves layout control, identity preservation, and text alignment on comp
1. There is little analysis of noisy or overlapping boxes, adapter or depth feature noise, conflict resolution when boxes intersect, scaling beyond five subjects, and the runtime or memory cost at common resolutions. 2. The method reads as an assembly of known pieces such as cross attention control, per box masking, and image adapters. The paper does not clearly isolate a new technical insight beyond this integration. 3. Results rely solely on detector/CLIP-based automatic metrics; without a use
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Domain Adaptation and Few-Shot Learning
