Are Object-Centric Representations Better At Compositional Generalization?

Ferdinand Kapl; Amir Mohammad Karimi Mamaghan; Maximilian Seitzer; Karl Henrik Johansson; Carsten Marr; Stefan Bauer; Andrea Dittadi

arXiv:2602.16689·cs.CV·February 19, 2026

Are Object-Centric Representations Better At Compositional Generalization?

Ferdinand Kapl, Amir Mohammad Karimi Mamaghan, Maximilian Seitzer, Karl Henrik Johansson, Carsten Marr, Stefan Bauer, Andrea Dittadi

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new benchmark to evaluate how well object-centric and dense vision encoders generalize to unseen object combinations in visual question answering, revealing that object-centric models excel under data or compute constraints.

Contribution

It provides a systematic comparison of object-centric versus dense representations in compositional generalization across three visual worlds, accounting for various training factors.

Findings

01

Object-centric models outperform dense models in challenging generalization settings.

02

Dense representations are better on easier tasks but require more data and compute.

03

Object-centric models are more sample-efficient and perform better with limited data.

Abstract

Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper is well-organized, with a logical flow from the introduction to the experimental design, making it easy for readers to follow the authors' reasoning and methodology. 2. The experimental part is relatively thorough and comprehensive.

Weaknesses

（1）Limited Novelty in Methodology: The core object-centric (OC) models proposed in this paper (DINOSAURv2, SigLIPSAUR2) are essentially derivative models built upon existing foundational models (DINOv2, SigLIP2) combined with Slot Attention (Locatello et al., 2020). No novel object decomposition mechanism or representation learning paradigm is introduced. The technical approach is essentially a combination of "existing foundational models + mature Slot Attention bottleneck," which overlaps h

Reviewer 02Rating 2Confidence 4

Strengths

The paper conducts extensive experiments analyzing common visual encoders, offering us a deeper understanding of them.

Weaknesses

1. How does the compositional generalization benchmark proposed in this paper fundamentally differ from existing compositional generalization tests? To my knowledge, other benchmarks also test compositional generalization with respect to attributes, e.g., cczsl [1] and c-gqa [2]. 2. The authors propose three findings—what can we do with these findings? In other words, what insights do they provide? Why are these three findings important? 3. What is the underlying mechanism by which OC represen

Reviewer 03Rating 2Confidence 2

Strengths

- Comprehensive experiments: Both in-distribution (ID) as well as compositional out-of-distribution (COOD) are reported. In COOD settings, they use 20% of object-property combinations for testing, while the rest for training. - The paper is well written and easy to read.

Weaknesses

- The results were only reported for synthetic datasets. No tests on natural images, real-image VQA, or open-vocabulary setups—so it’s unclear the findings transfer beyond those toy examples. - Fig. 2 shows a strong ID-COOD correlation across settings. However, the ID–COOD correlation is somewhat expected, not novel (see [1] for example). This is an unsurprising result that many would anticipate when training distributions are simplified. It doesn’t advance understanding of why dense and OC fea

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning