Does Data Scaling Lead to Visual Compositional Generalization?
Arnas Uselis, Andrea Dittadi, Seong Joon Oh

TL;DR
This paper investigates whether data scaling improves compositional generalization in vision models, finding that data diversity, not scale, is key to learning compositional structures that enable efficient generalization.
Contribution
It demonstrates that compositional generalization depends on data diversity and concept coverage, and that a linearly factored representational structure underpins efficient compositional learning.
Findings
Data diversity drives compositional generalization.
Increased combinatorial coverage induces a factored representational structure.
Pretrained models show partial evidence of this structure.
Abstract
Compositional understanding is crucial for human intelligence, yet it remains unclear whether contemporary vision models exhibit it. The dominant machine learning paradigm is built on the premise that scaling data and model sizes will improve out-of-distribution performance, including compositional generalization. We test this premise through controlled experiments that systematically vary data scale, concept diversity, and combination coverage. We find that compositional generalization is driven by data diversity, not mere data scale. Increased combinatorial coverage forces models to discover a linearly factored representational structure, where concepts decompose into additive components. We prove this structure is key to efficiency, enabling perfect generalization from few observed combinations. Evaluating pretrained models (DINO, CLIP), we find above-random yet imperfect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeochemistry and Geologic Mapping · Topological and Geometric Data Analysis · Domain Adaptation and Few-Shot Learning
