Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment
Bac Nguyen, Yuhta Takida, Naoki Murata, Chieh-Hsin Lai, Toshimitsu Uesaka, Stefano Ermon, Yuki Mitsufuji

TL;DR
CODA enhances object-centric learning by reducing slot interference and explicitly aligning slots with image content, leading to improved object discovery and property prediction in complex scenes.
Contribution
The paper introduces register slots and a contrastive alignment loss to improve slot-image correspondence in diffusion-based object-centric learning.
Findings
Significant improvement in object discovery metrics on COCO dataset (+6.1% FG-ARI).
Enhanced property prediction and compositional image generation.
Efficient and scalable approach with negligible overhead.
Abstract
Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Object-centric Diffusion Alignment (CODA), a simple extension that (i) employs register slots to absorb residual attention and reduce interference between object slots, and (ii) applies a contrastive alignment loss to explicitly encourage slot-image correspondence. The resulting training objective serves as a tractable surrogate for maximizing mutual information (MI) between slots and inputs, strengthening slot representation quality. On both synthetic (MOVi-C/E) and real-world datasets (VOC, COCO), CODA improves object discovery (e.g., +6.1% FG-ARI on COCO), property prediction, and compositional image generation over strong baselines. Register slots add…
Peer Reviews
Decision·ICLR 2026 Poster
* (S1) The paper is very well written, clearly explains the shortcomings of existing methods and motivates the proposed solutions. The methodology section is well-explained. * (S2) Though the ideas in the paper can be seen as a combination of several ideas from existing works (register tokens [1], negative guidance [2], contrastive learning of slots [3]), the paper creatively combine these ideas in the context of slot attention methods. This enhances the technical contributions of the paper. * (
* (W1) Fairness of evaluation — A significant issue in the comparison with other methods is the use of DINOv2 and the 512x512 image size for the training method. Other methods, such as SPOT and DINOSAUR, use a DINOv1 model with an image size much smaller than 512x512. Thus, the proposed gains in the method cannot be attributed solely to the slot attention registers and constructive loss. * (W2) Ablations on the VOC dataset — Though I understand that performing the ablations on VOC is quicker, th
- In this era of foundation models, I always believe the community of OCL should move to more pre-trained models. This paper is a nice attempt at using DINO and SD models, and demonstrates effectiveness at large-scale real-world datasets. - All three techniques in CODA are well-motivated and implemented. I appreciate the thorough ablations in the main paper and the Appendix. They help answer the effectiveness of each component very well. - The analysis of mutual information (MI) is interesting a
I don't see big weaknesses in the paper. Some minor weaknesses: 1. Each component is not very novel, as they pre-exist in other areas. Though I don't view this as a big issue -- combining them in a nice way and achieve strong results is also a good contribution. 2. Why not comparing with GLASS in the experiments? It seems that the segmentation results are similar to them (by checking tables in their paper), but I believe the generation capability of CODA must be stronger. Please include this in
The method reports state-of-the-art results on both unsupervised object segmentation and compositional generation benchmarks, though the fairness of the compositional generation evaluation is debatable (see Weaknesses first and last point). - The idea is conceptually simple, it does not introduce any architectural changes. (Although new register tokens can cost in terms of runtime (mostly negligible I guess, as the paper states that it only costs 0.02%) and contrastive loss is actually 2x forwar
- The compositional generation evaluation is not entirely fair. The proposed method (CODA) is explicitly trained to reconstruct images given random slot combinations through its contrastive objective, which involves positive and negative slot pairs. This directly optimizes the model for composition-like reconstruction, so improved performance under this metric is somewhat expected. The qualitative results, however, are convincing and indicate stronger disentanglement than SlotAdapt, likely due t
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
