Are Convolutional Neural Networks or Transformers more like human vision?
Shikhar Tuli, Ishita Dasgupta, Erin Grant, Thomas L. Griffiths

TL;DR
This paper compares CNNs and Vision Transformers in visual recognition, showing that Transformers' errors are more human-like, which has implications for developing more human-like AI vision systems.
Contribution
It provides a behavioral analysis comparing CNNs and Vision Transformers, revealing that Transformers' error patterns align more closely with human vision.
Findings
Transformers have error patterns more similar to humans than CNNs.
Error analysis shows Transformers relax certain inductive biases of CNNs.
Results suggest Transformers may be better models for human-like vision.
Abstract
Modern machine learning models for computer vision exceed humans in accuracy on specific visual recognition tasks, notably on datasets like ImageNet. However, high accuracy can be achieved in many ways. The particular decision function found by a machine learning system is determined not only by the data to which the system is exposed, but also the inductive biases of the model, which are typically harder to characterize. In this work, we follow a recent trend of in-depth behavioral analyses of neural network models that go beyond accuracy as an evaluation metric by looking at patterns of errors. Our focus is on comparing a suite of standard Convolutional Neural Networks (CNNs) and a recently-proposed attention-based network, the Vision Transformer (ViT), which relaxes the translation-invariance constraint of CNNs and therefore represents a model with a weaker set of inductive biases.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Residual Connection · Dense Connections · Adam · Vision Transformer
