Low-Pass Filtering Improves Behavioral Alignment of Vision Models
Max Wolff, Thomas Klein, Evgenia Rusak, Felix Wichmann, Wieland Brendel

TL;DR
Applying low-pass filtering to vision models, especially through image blurring at test-time, significantly enhances their behavioral alignment with human visual responses, reducing the gap in error consistency and shape bias.
Contribution
This work demonstrates that simple low-pass filtering, such as image blurring, explains the improved behavioral alignment of generative models and sets a new state-of-the-art for model-human similarity.
Findings
Blurring images at test-time improves model-human behavioral alignment.
Optimal Gaussian filters match human contrast sensitivity functions.
Test-time low-pass filtering halves the alignment gap between DNNs and humans.
Abstract
Despite their impressive performance on computer vision benchmarks, Deep Neural Networks (DNNs) still fall short of adequately modeling human visual behavior, as measured by error consistency and shape bias. Recent work hypothesized that behavioral alignment can be drastically improved through \emph{generative} -- rather than \emph{discriminative} -- classifiers, with far-reaching implications for models of human vision. Here, we instead show that the increased alignment of generative models can be largely explained by a seemingly innocuous resizing operation in the generative model which effectively acts as a low-pass filter. In a series of controlled experiments, we show that removing high-frequency spatial information from discriminative models like CLIP drastically increases their behavioral alignment. Simply blurring images at test-time -- rather than training on blurred images…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is well-organized and well-written, with smooth narrative flow from hypothesis to methods to results. I think the paper provides a simple yet powerful reinterpretation of why generative vision models appear more human-like, reframing a potentially deep theoretical question (generative vs. discriminative objectives) as a signal-processing issue (frequency content). As well, The authors test across numerous architectures, including large-scale CLIP models, and show strong generalization
The psychophysical alignment argument can be strengthen by testing on neural benchmarks (e.g., Brain-Score, Algonauts) to verify that low-pass filtering also aligns representational geometry with biological vision. Also a control, may include testing under longer exposure times or more naturalistic conditions which could clarify whether the low-pass benefit is tied to the MvH’s 200 ms presentation constraint or generalizes across temporal regimes. While the learned Fourier filter converges
- The writing is clear, the narrative is well structured, figures effectively support the claims. - The paper offers a simple, general, physiologically grounded intervention with immediate practical impact: prepend a low-pass filter at test time to improve human alignment without retraining. - Reinterprets prior SOTA claims by pinning gains to an overlooked preprocessing step. The Pareto-frontier view of MvH is insightful. - Broad cross-model evaluation; consistent trends; convergence of three
- Potential overfitting. The learned Fourier filter is optimized on MvH without a held-out set. But it makes sense in the context. - l.290: Please redefine EC ans SB because it requires coming back to previous section to find their meaning.
The (apparently) simple experiments proposed here, removing high-frequency information either by down-sampling or by low-pass filtering the input images before feeding the models, actually addresses a *key discussion* in machine vision: is it better a bottom-up approach to analyze the images and then discriminate between classes? (as done in conventional discriminative classifiers), or is it better a top-down approach where one checks if the input is compatible to generated examples form a given
Weaknesses in this work are minor, mainly limited to (a) some notation issues, (b) mention to works that show the emergence of human-like Contrast Sensitivity in artificial nets, and (c) clarification of the discussion between using the low-pass in training or test time. (a) Notation in Eqs. 1-3 can be more clear. I eleborate in the "questions" box below. (b) In the Related Work section the authors mention the work of Subramanian et al. 23 on the frequency response of ANNs. Other works (e.g. L
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace Recognition and Perception · Visual perception and processing mechanisms · Generative Adversarial Networks and Image Synthesis
