Provable Compositional Generalization for Object-Centric Learning
Thadd\"aus Wiedemer, Jack Brady, Alexander Panfilov, Attila Juhos,, Matthias Bethge, Wieland Brendel

TL;DR
This paper provides a theoretical framework showing that under certain conditions, object-centric autoencoders can provably generalize to novel object compositions, supported by experiments on synthetic data.
Contribution
It introduces a theoretical analysis using identifiability theory to guarantee compositional generalization in object-centric autoencoders, clarifying when this is possible.
Findings
Autoencoders with structural decoder assumptions generalize compositionally.
Encoder-decoder consistency enforces learning of object-centric representations.
Experimental validation on synthetic data supports the theoretical claims.
Abstract
Learning representations that generalize to novel compositions of known concepts is crucial for bridging the gap between human and machine perception. One prominent effort is learning object-centric representations, which are widely conjectured to enable compositional generalization. Yet, it remains unclear when this conjecture will be true, as a principled theoretical or empirical understanding of compositional generalization is lacking. In this work, we investigate when compositional generalization is guaranteed for object-centric representations through the lens of identifiability theory. We show that autoencoders that satisfy structural assumptions on the decoder and enforce encoder-decoder consistency will learn object-centric representations that provably generalize compositionally. We validate our theoretical result and highlight the practical relevance of our assumptions through…
Peer Reviews
Decision·ICLR 2024 oral
This paper made contributions for - Formalizing compositional generalization as an identifiability problem - Theoretical guarantees for in-distribution identifiability - Showing an additive decoder enables out-of-distribution generalization - Introducing compositional consistency regularization - Providing overall theoretical guarantees for compositional generalization The work makes theoretical progress on understanding compositional generalization in object-centric representation learning.
- The assumptions of compositionality and irreducibility are quite restrictive. Most real-world datasets likely violate these. - The additive decoder limits modeling of complex object interactions and relations. - The consistency regularization implementation requires sampling implausible object combinations. More principled schemes could improve results in complex environments. - Experiments only validate the theory on simple synthetic datasets. Testing on more diverse and realistic data would
This paper discusses an important problem: learning compositionally generalizable object-centric representations. The paper is well-written and easy to read. The connections with related works are also interesting and inspiring. The reviewer especially appreciates the theoretical guarantees and analysis. Even though the assumptions are strong on both the functions to be approximated as well as the parameterization of learned functions, they are still aligned with the image object-centric repre
It would be great if the assumptions could be relaxed, e.g., to handle occluded objects or to handle general latent variable learning domains other than the image objects. The "contemporary" work [1] discussed most parts of this paper except for the generalizable encoder. The experimental environment is simple with two-object synthetic images. It would be more convincing to see results on multi-object real images. [1] S ́ ebastien Lachapelle, Divyat Mahajan, Ioannis Mitliagkas, and Simon L
- The paper is very well-written. - The theory is sound and significant for the community. - The joint encoder-decoder framework for compositional generalization in autoencoders is quite elegant. - The limitations of the framework and the additivity constraint on the decoder are adequately stated.
Although they support the theory, the experiments are quite limited. For instance, these are all with only two slots with 16 dimensions each. See the questions section for additional information that would be interesting to see from experimentation.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Digital Imaging for Blood Diseases
