AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization

Binhe Yu; Zhen Wang; Kexin Li; Yuqian Yuan; Wenqiao Zhang; Long Chen; Juncheng Li; Jun Xiao; Yueting Zhuang

arXiv:2512.23537·cs.CV·January 5, 2026

AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization

Binhe Yu, Zhen Wang, Kexin Li, Yuqian Yuan, Wenqiao Zhang, Long Chen, Juncheng Li, Jun Xiao, Yueting Zhuang

PDF

Open Access 4 Reviews

TL;DR

AnyMS is a training-free framework that uses bottom-up attention decoupling to effectively synthesize multi-subject images guided by layout, text, and subject images, balancing alignment, identity, and layout control.

Contribution

It introduces a novel bottom-up dual-level attention decoupling mechanism for layout-guided multi-subject customization without additional training.

Findings

01

Achieves state-of-the-art performance in multi-subject image synthesis.

02

Supports complex compositions with multiple subjects.

03

Operates without subject-specific training or tuning.

Abstract

Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper is clearly written and easy to follow. 2. The figures and tables are clear and easy to understand. 3. The authors use many formulas to help readers better understand the concepts.

Weaknesses

1. The motivation for the global decoupling design is insufficient. In lines 246-251, the authors claim that concatenating the subject token with the text prompt leads to various issues. However, this claim lacks persuasiveness as the root causes of these issues are not adequately analyzed (e.g. attention map). This further undermines the motivation for the global decoupling design. If I understand correctly, this component is essentially an IP-Adapter, which reinforces the concern about the lac

Reviewer 02Rating 4Confidence 4

Strengths

- This problem of generation of layout-guided and training-free multi-subject customization is an impoartant task. - The writing is overall clear and easy to follow.

Weaknesses

- Incremental contribution: The work lies within a line of research that includes numerous recent studies on text alignment, ID preservation, and layout control. This work is particularly in the context of training-free methods based on attention manipulation. Thus, the novelty appears incremental given the existing works. - Questionable trade-off between tasks: The assumption that text alignment, subject identity preservation, and layout control inherently require a performance trade-off is deb

Reviewer 03Rating 6Confidence 5

Strengths

1. The results for flexibly composing a varying number of subjects are very impressive. 2. The core innovations and contributions are presented clearly and effectively.

Weaknesses

1. The core contribution of the paper is the "local decoupling" attention mechanism. However, the experimental section lacks a detailed analysis of it. Specifically, the paper would be strengthened by a comparison with other methods that also focus on improving attention mechanisms for customized generation, as well as visualizations of the proposed attention maps to demonstrate its effectiveness. 2. There appears to be a significant overlap in the information presented in Figure 4 and Table 1,

Reviewer 04Rating 4Confidence 4

Strengths

Overall, the paper has a clear and practical focus on layout guided multi subject customization. The design goes straight at the usual failure modes: identity leakage, boundary spillover, and layout drift. The dual level attention decoupling runs only at inference and, paired with a pretrained image adapter, gives a training free pipeline that keeps untouched regions stable while enforcing per box control. Empirically, it improves layout control, identity preservation, and text alignment on comp

Weaknesses

1. There is little analysis of noisy or overlapping boxes, adapter or depth feature noise, conflict resolution when boxes intersect, scaling beyond five subjects, and the runtime or memory cost at common resolutions. 2. The method reads as an assembly of known pieces such as cross attention control, per box masking, and image adapters. The paper does not clearly isolate a new technical insight beyond this integration. 3. Results rely solely on detector/CLIP-based automatic metrics; without a use

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Domain Adaptation and Few-Shot Learning