VFA: Vision Frequency Analysis of Foundation Models and Human
Mohammad-Javad Darvishi-Bayazi, Md Rifat Arefin, Jocelyn Faubert,, Irina Rish

TL;DR
This paper explores how large-scale vision models can be aligned with human perception to improve robustness against distribution shifts, highlighting the impact of model size, data richness, and multimodal features.
Contribution
It introduces a comprehensive analysis of factors influencing model-human alignment and robustness, emphasizing the importance of size, semantic richness, and multimodal data.
Findings
Larger models and datasets improve alignment with human perception.
Rich semantic information enhances model robustness.
Multimodal models show better out-of-distribution performance.
Abstract
Machine learning models often struggle with distribution shifts in real-world scenarios, whereas humans exhibit robust adaptation. Models that better align with human perception may achieve higher out-of-distribution generalization. In this study, we investigate how various characteristics of large-scale computer vision models influence their alignment with human capabilities and robustness. Our findings indicate that increasing model and data size and incorporating rich semantic information and multiple modalities enhance models' alignment with human perception and their overall robustness. Our empirical analysis demonstrates a strong correlation between out-of-distribution accuracy and human alignment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage
MethodsALIGN
