PICS: Pairwise Image Compositing with Spatial Interactions
Hang Zhou, Xinxin Zuo, Sen Wang, Li Cheng

TL;DR
PICS introduces a self-supervised, interaction-aware image compositing method that models spatial relations and object interactions to improve the quality and stability of pairwise image edits, outperforming existing approaches.
Contribution
The paper proposes a novel Interaction Transformer with mask-guided Mixture-of-Experts and adaptive blending for improved spatial coherence in image compositing.
Findings
Superior compositing quality over state-of-the-art methods
Enhanced stability in pairwise image edits
Effective handling of geometric variations
Abstract
Despite strong single-turn performance, diffusion-based image compositing often struggles to preserve coherent spatial relations in pairwise or sequential edits, where subsequent insertions may overwrite previously generated content and disrupt physical consistency. We introduce PICS, a self-supervised composition-by-decomposition paradigm that composes objects in parallel while explicitly modeling the compositional interactions among (fully-/partially-)visible objects and background. At its core, an Interaction Transformer employs mask-guided Mixture-of-Experts to route background, exclusive, and overlap regions to dedicated experts, with an adaptive {\alpha}-blending strategy that infers a compatibility-aware fusion of overlapping objects while preserving boundary fidelity. To further enhance robustness to geometric variations, we incorporate geometry-aware augmentations covering both…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is well-written, and the proposed methodology is sound. 2. The approach is well-supported by sufficient experimental results, including both quantitative and qualitative comparisons. 3. The core idea of the Mask-guided Mixture-of-Experts (MoE) is interesting.
1. Dependency on Mask Quality: The method's performance relies on precise input masks. The paper lacks a sensitivity analysis on mask quality (e.g., comparing results from high-quality vs. coarse masks or different segmentation sources), making it difficult to assess its robustness to imperfect inputs. 2. Background-based Gating: The gating mechanism for occlusion (Eqs. 6-9) is primarily based on background-object similarity. This may lead to incorrect judgments when objects share similar textu
This paper offers a novel parallel compositing paradigm and MoE-based transformer that targets pairwise relations, addressing instability in multi-turn diffusion edits. The paper is well-written and easy-to-read. The experimental design is reasonable.
1. Despite superior metrics, visual fidelity remains unsatisfactory in some samples, with identity preservation issues such as altered stitch on the bag (2nd row, Figure 4), distorted bottle label (4th row, Figure 4), and changed shoe patterns (last row, Figure 5), suggesting limitations in retaining fine details during interaction modeling. 2. The restriction to exactly two objects may limit the method's broader applicability. Extensions to more than 2 objects are not explored in the paper, le
The paper identifies a real issue, multi-turn compositing instability, and addresses it with a structured, parallel formulation. The task definition is well-motivated. The mask-guided MoE and α-blending are simple but intuitive mechanisms to ensure boundary consistency.
It seems that the method is explicitly limited to pairwise (two objects) compositing. Could PICS perform well on three-object, four object compositing? Please give me several visual examples if available. Baselines are outdated. To the best of my knowledge, some open-source object insertion models can perform much better than the baselines you have chosen, such FreeCompose [1], OmniPaint [2], and Insert Anything [3]. Could you provide a comparison of these models? The proposed mask-guided MoE
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Medical Image Segmentation Techniques
