Do computer vision foundation models learn the low-level characteristics   of the human visual system?

Yancheng Cai; Fei Yin; Dounia Hammou; Rafal Mantiuk

arXiv:2502.20256·cs.CV·March 13, 2025

Do computer vision foundation models learn the low-level characteristics of the human visual system?

Yancheng Cai, Fei Yin, Dounia Hammou, Rafal Mantiuk

PDF

Open Access

TL;DR

This study evaluates whether computer vision foundation models mimic low-level human visual system characteristics, finding that some models like DINOv2 show notable similarities, especially in contrast masking, but overall differences remain.

Contribution

The paper introduces a protocol to compare foundation models' low-level visual characteristics with human vision, revealing partial similarities and differences among models.

Findings

01

DINOv2 shows the closest resemblance to human contrast masking.

02

Foundation models exhibit less sensitivity to low contrast.

03

Responses to contrast across frequencies are irregular in models.

Abstract

Computer vision foundation models, such as DINO or OpenCLIP, are trained in a self-supervised manner on large image datasets. Analogously, substantial evidence suggests that the human visual system (HVS) is influenced by the statistical distribution of colors and patterns in the natural world, characteristics also present in the training data of foundation models. The question we address in this paper is whether foundation models trained on natural images mimic some of the low-level characteristics of the human visual system, such as contrast detection, contrast masking, and contrast constancy. Specifically, we designed a protocol comprising nine test types to evaluate the image encoders of 45 foundation and generative models. Our results indicate that some foundation models (e.g., DINO, DINOv2, and OpenCLIP), share some of the characteristics of human vision, but other models show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual perception and processing mechanisms · Infrared Target Detection Methodologies

MethodsSoftmax · Dense Connections · Linear Layer · Layer Normalization · Residual Connection · Attention Is All You Need · Multi-Head Attention · Vision Transformer · ALIGN · self-DIstillation with NO labels