Generation is Required for Data-Efficient Perception
Jack Brady, Bernhard Sch\"olkopf, Thomas Kipf, Simon Buchholz, Wieland Brendel

TL;DR
This paper investigates whether generative models are necessary for human-like visual perception, demonstrating that generative approaches with proper inductive biases enable better compositional generalization than non-generative methods.
Contribution
The paper formalizes the inductive biases needed for compositional generalization, showing that they are feasible in generative models but not in encoder-based models, and empirically validates these findings.
Findings
Generative methods outperform non-generative ones in compositional generalization.
Non-generative methods require large-scale pretraining or supervision to generalize well.
Enforcing inductive biases in generative models enables efficient compositional generalization.
Abstract
It has been hypothesized that human-level visual perception requires a generative approach in which internal representations result from inverting a decoder. Yet today's most successful vision models are non-generative, relying on an encoder that maps images to representations without decoder inversion. This raises the question of whether generation is, in fact, necessary for machines to achieve human-level visual perception. To address this, we study whether generative and non-generative methods can achieve compositional generalization, a hallmark of human perception. Under a compositional data generating process, we formalize the inductive biases required to guarantee compositional generalization in decoder-based (generative) and encoder-based (non-generative) methods. We then show theoretically that enforcing these inductive biases on encoders is generally infeasible using…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is well-written and easy to follow. 2. The paper is well motivated. There has been extensive discussions in the literature regarding generative and non-generative perception approaches for a long time, including many empirical studies. However, theoretical analysis regarding this is still missed. This work is trying to bridge this gap.
1. Most of theoretical results are based on [Brady et al,. 2025], and thus obtaining results of Lemma 3.1 and Theorem 3.2. seem a bit straightforward. 2. For gradient-based search, when using encoder to provide initial guess, it is hard to determine if part of the compositionality performance of decoder comes from encoder. For Figure 6 B, is an additional encoder used? 3. For generative replay, though not explicitly requiring external data, it is essentially using learned generative model to
1. Strong Theoretical Foundation: Provides rigorous proofs establishing the infeasibility of enforcing compositional generalization guarantees on encoders (Theorem 3.2, Lemma 3.1, A.4) and the feasibility for decoders (Section 3, Theorem A.8). 2. Compelling Empirical Validation: Uses controlled, photorealistic datasets (PUG) to demonstrate the practical limitations of non-generative models (Fig 5) and the practical benefits of generative approaches + inversion techniques (Fig 6). 3. Novel Insigh
1. Computational Cost: The generative + search/replay approach is inherently more computationally expensive (per-query optimization or large generative model) than a single forward pass through an encoder. This practical trade-off is acknowledged but not deeply analyzed. 2. Scalability to Complex Real-World Data: While PUG is controlled, experiments don't scale to the complexity of full ImageNet or real-world uncurated data. 3. Limited Exploration of Alternative Generative Setups: Primarily use
1. Modern computer vision systems are largely discriminative: they encode images for classification or embedding without explicit generative components. In contrast, generative frameworks have been championed in computational neuroscience for explaining feedback and recurrent processing in visual cortex (e.g., Mumford; Lee & Mumford; Rao & Ballard). This paper, therefore, addresses a fundamental and long-standing question: what computational benefits do generative models confer for perception?
1. All experiments are conducted on the PUG synthetic dataset. While PUG is well-designed for controlled compositional generalization studies, it lacks the semantic richness and visual complexity of natural images (e.g., occlusion, clutter, long-tail distribution shifts). It therefore remains unclear how well the claims would transfer to real-world perception tasks. Testing on real-image compositional benchmarks would strengthen the broader argument that generative mechanisms are required for pe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face Recognition and Perception · Visual perception and processing mechanisms
