Are Convolutional Neural Networks or Transformers more like human   vision?

Shikhar Tuli; Ishita Dasgupta; Erin Grant; Thomas L. Griffiths

arXiv:2105.07197·cs.CV·July 2, 2021·21 cites

Are Convolutional Neural Networks or Transformers more like human vision?

Shikhar Tuli, Ishita Dasgupta, Erin Grant, Thomas L. Griffiths

PDF

Open Access 1 Repo

TL;DR

This paper compares CNNs and Vision Transformers in visual recognition, showing that Transformers' errors are more human-like, which has implications for developing more human-like AI vision systems.

Contribution

It provides a behavioral analysis comparing CNNs and Vision Transformers, revealing that Transformers' error patterns align more closely with human vision.

Findings

01

Transformers have error patterns more similar to humans than CNNs.

02

Error analysis shows Transformers relax certain inductive biases of CNNs.

03

Results suggest Transformers may be better models for human-like vision.

Abstract

Modern machine learning models for computer vision exceed humans in accuracy on specific visual recognition tasks, notably on datasets like ImageNet. However, high accuracy can be achieved in many ways. The particular decision function found by a machine learning system is determined not only by the data to which the system is exposed, but also the inductive biases of the model, which are typically harder to characterize. In this work, we follow a recent trend of in-depth behavioral analyses of neural network models that go beyond accuracy as an evaluation metric by looking at patterns of errors. Our focus is on comparing a suite of standard Convolutional Neural Networks (CNNs) and a recently-proposed attention-based network, the Vision Transformer (ViT), which relaxes the translation-invariance constraint of CNNs and therefore represents a model with a weaker set of inductive biases.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shikhartuli/cnn_txf_bias
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Residual Connection · Dense Connections · Adam · Vision Transformer