Learning to Compose: Improving Object Centric Learning by Injecting Compositionality

Whie Jung; Jaehoon Yoo; Sungjin Ahn; Seunghoon Hong

arXiv:2405.00646·cs.CV·November 11, 2025

Learning to Compose: Improving Object Centric Learning by Injecting Compositionality

Whie Jung, Jaehoon Yoo, Sungjin Ahn, Seunghoon Hong

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces a new objective for object-centric learning that explicitly promotes compositionality, leading to better object representations and robustness across different architectures.

Contribution

It proposes a novel compositionality-enforcing objective integrated with existing frameworks like slot attention, improving object representation learning.

Findings

01

Enhanced object representation quality

02

Improved robustness to architectural variations

03

Consistent performance gains across experiments

Abstract

Learning compositional representation is a key aspect of object-centric learning as it enables flexible systematic generalization and supports complex visual reasoning. However, most of the existing approaches rely on auto-encoding objective, while the compositionality is implicitly imposed by the architectural or algorithmic bias in the encoder. This misalignment between auto-encoding objective and learning compositionality often results in failure of capturing meaningful object representations. In this study, we propose a novel objective that explicitly encourages compositionality of the representations. Built upon the existing object-centric learning framework (e.g., slot attention), our method incorporates additional constraints that an arbitrary mixture of object representations from two images should be valid by maximizing the likelihood of the composite data. We demonstrate that…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- **Evaluation**: Multiple types of experiments are conducted, including quantitative/qualitative unsupervised segmentation, ablation studies, and analysis of robustness. The approach is shown to be robust to factors such as number of slots, as well as settings of the encoder and decoder, aspects which slot attention methods are usually sensitive to. - **Research Context**: The authors do a good job in providing the relevant research context as well as model’s preliminaries, pointing to the limi

Weaknesses

- **Synthetic Data & Scalability**: The experiments are performed over synthetic data only. I recommend exploring scalability to real-world data too. This is especially important since it is unclear to me whether such an approach would scale well to more diverse real-world data, where there are correlations between object occurrence as well as appearance. I don’t know for sure but it might be the case that for realistic data the loss could damage the model’s learning compared to standard auto-en

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

The paper's methodology is well described with helpful figures, and the experiments are comprehensive. Notably, the experiments on parameter robustness in sec 5.2, as well as the qualitative results shown alongside experiments with slot mixing strategy in the appendices, provide a compelling case for the benefits of this method against other SoTA methods, beyond the mere improvement of performance.

Weaknesses

There are no major weaknesses with the methodology or evaluations. The presentation is clear in most places, but there are many grammatical mistakes. It is recommended that these be fixed through the use of automated grammar checkers or proof-reading by a native speaker. Some minor notes on clarification: * On the first reading, it was not entirely clear how the one-shot decoder and diffusion decoders were being trained (i.e. whether one, or both, were being used in the auto-encoding path) - th

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

The authors identify a weakness in the standard reconstruction objective of Slot Attention and similar papers: the reconstruction objective does not explicitly encourage compositionality of the learned representation. This observation calls for a generative objective that ensures that composition of slots can be decoded into realistic images. Instead of implementing this with a GAN discriminator, which would require an additional component, the authors follow the approach of Poole et al. (2022)

Weaknesses

In the proposed method, the number of losses, regularizers, and training tricks to balance is high. I would appreciate a more thorough ablation study where each component is isolated and tested. The combinations in table 2 are valid but do not cover all possibilities. I would appreciate additional evaluation metrics that are not based on segmentation. Learning a good object-centric representation means much more than achieving good segmentation. It is important to show that the learned slots ar

Code & Models

Repositories

whieya/learning-to-compose
jaxOfficial

Videos

Learning to Compose: Improving Object Centric Learning by Injecting Compositionality· slideslive

Taxonomy

TopicsCognitive Science and Education Research · Robotics and Automated Systems · Innovative Teaching and Learning Methods