Improving generalization by mimicking the human visual diet
Spandan Madan, You Li, Mengmi Zhang, Hanspeter Pfister, Gabriel, Kreiman

TL;DR
This paper proposes mimicking the human visual diet by training models on diverse, context-rich data to improve their ability to generalize across real-world visual transformations and from synthetic to natural images.
Contribution
It introduces a new dataset and a transformer model designed to emulate the human visual diet, significantly enhancing generalization in computer vision tasks.
Findings
Models trained with the human visual diet outperform specialized architectures on natural images.
Incorporating scene context and transformations improves robustness to lighting, viewpoint, and material changes.
The approach narrows the gap between synthetic and real-world data generalization.
Abstract
We present a new perspective on bridging the generalization gap between biological and computer vision -- mimicking the human visual diet. While computer vision models rely on internet-scraped datasets, humans learn from limited 3D scenes under diverse real-world transformations with objects in natural context. Our results demonstrate that incorporating variations and contextual cues ubiquitous in the human visual training data (visual diet) significantly improves generalization to real-world transformations such as lighting, viewpoint, and material changes. This improvement also extends to generalizing from synthetic to real-world data -- all models trained with a human-like visual diet outperform specialized architectures by large margins when tested on natural image data. These experiments are enabled by our two key contributions: a novel dataset capturing scene context and diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCell Image Analysis Techniques · Human Pose and Action Recognition · Advanced Vision and Imaging
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Dense Connections · Layer Normalization · Vision Transformer
