Learning to Compose: Improving Object Centric Learning by Injecting Compositionality
Whie Jung, Jaehoon Yoo, Sungjin Ahn, Seunghoon Hong

TL;DR
This paper introduces a new objective for object-centric learning that explicitly promotes compositionality, leading to better object representations and robustness across different architectures.
Contribution
It proposes a novel compositionality-enforcing objective integrated with existing frameworks like slot attention, improving object representation learning.
Findings
Enhanced object representation quality
Improved robustness to architectural variations
Consistent performance gains across experiments
Abstract
Learning compositional representation is a key aspect of object-centric learning as it enables flexible systematic generalization and supports complex visual reasoning. However, most of the existing approaches rely on auto-encoding objective, while the compositionality is implicitly imposed by the architectural or algorithmic bias in the encoder. This misalignment between auto-encoding objective and learning compositionality often results in failure of capturing meaningful object representations. In this study, we propose a novel objective that explicitly encourages compositionality of the representations. Built upon the existing object-centric learning framework (e.g., slot attention), our method incorporates additional constraints that an arbitrary mixture of object representations from two images should be valid by maximizing the likelihood of the composite data. We demonstrate that…
Peer Reviews
Decision·ICLR 2024 poster
- **Evaluation**: Multiple types of experiments are conducted, including quantitative/qualitative unsupervised segmentation, ablation studies, and analysis of robustness. The approach is shown to be robust to factors such as number of slots, as well as settings of the encoder and decoder, aspects which slot attention methods are usually sensitive to. - **Research Context**: The authors do a good job in providing the relevant research context as well as model’s preliminaries, pointing to the limi
- **Synthetic Data & Scalability**: The experiments are performed over synthetic data only. I recommend exploring scalability to real-world data too. This is especially important since it is unclear to me whether such an approach would scale well to more diverse real-world data, where there are correlations between object occurrence as well as appearance. I don’t know for sure but it might be the case that for realistic data the loss could damage the model’s learning compared to standard auto-en
The paper's methodology is well described with helpful figures, and the experiments are comprehensive. Notably, the experiments on parameter robustness in sec 5.2, as well as the qualitative results shown alongside experiments with slot mixing strategy in the appendices, provide a compelling case for the benefits of this method against other SoTA methods, beyond the mere improvement of performance.
There are no major weaknesses with the methodology or evaluations. The presentation is clear in most places, but there are many grammatical mistakes. It is recommended that these be fixed through the use of automated grammar checkers or proof-reading by a native speaker. Some minor notes on clarification: * On the first reading, it was not entirely clear how the one-shot decoder and diffusion decoders were being trained (i.e. whether one, or both, were being used in the auto-encoding path) - th
The authors identify a weakness in the standard reconstruction objective of Slot Attention and similar papers: the reconstruction objective does not explicitly encourage compositionality of the learned representation. This observation calls for a generative objective that ensures that composition of slots can be decoded into realistic images. Instead of implementing this with a GAN discriminator, which would require an additional component, the authors follow the approach of Poole et al. (2022)
In the proposed method, the number of losses, regularizers, and training tricks to balance is high. I would appreciate a more thorough ablation study where each component is isolated and tested. The combinations in table 2 are valid but do not cover all possibilities. I would appreciate additional evaluation metrics that are not based on segmentation. Learning a good object-centric representation means much more than achieving good segmentation. It is important to show that the learned slots ar
Code & Models
Videos
Taxonomy
TopicsCognitive Science and Education Research · Robotics and Automated Systems · Innovative Teaching and Learning Methods
