Saliency Suppressed, Semantics Surfaced: Visual Transformations in Neural Networks and the Brain
Gustaw Opie{\l}ka, Jessica Loke, Steven Scholte

TL;DR
This paper investigates how neural networks encode visual saliency and semantics, revealing differences in sensitivity and suppression strategies, and highlights the role of natural language supervision in aligning AI with human perception.
Contribution
It introduces a new dataset and employs representational analysis to compare saliency and semantic encoding in neural networks and the brain, revealing the effects of supervision methods.
Findings
ResNets are more sensitive to saliency than ViTs.
Networks suppress saliency early in processing, especially with CLIP supervision.
Semantic encoding correlates with better alignment to human perception.
Abstract
Deep learning algorithms lack human-interpretable accounts of how they transform raw visual input into a robust semantic understanding, which impedes comparisons between different architectures, training objectives, and the human brain. In this work, we take inspiration from neuroscience and employ representational approaches to shed light on how neural networks encode information at low (visual saliency) and high (semantic similarity) levels of abstraction. Moreover, we introduce a custom image dataset where we systematically manipulate salient and semantic information. We find that ResNets are more sensitive to saliency information than ViTs, when trained with object classification objectives. We uncover that networks suppress saliency in early layers, a process enhanced by natural language supervision (CLIP) in ResNets. CLIP also enhances semantic encoding in both architectures.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAesthetic Perception and Analysis
MethodsContrastive Language-Image Pre-training
