ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy
Kirill Vishniakov, Zhiqiang Shen, Zhuang Liu

TL;DR
This paper compares ConvNet and Vision Transformer models trained with supervised and CLIP methods, revealing differences in behavior beyond ImageNet accuracy, such as mistake types, calibration, transferability, and invariance.
Contribution
It provides a detailed analysis of model behaviors beyond accuracy, emphasizing the importance of nuanced metrics for model selection in computer vision.
Findings
Models differ in mistake types and calibration.
Transferability varies across models and training paradigms.
Traditional metrics do not capture all performance aspects.
Abstract
Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their classification accuracy on ImageNet. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy, for both ConvNet and Vision Transformer architectures, each across supervised and CLIP training paradigms. Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects: types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by…
Peer Reviews
Decision·ICML 2024 Poster
This paper provides a comprehensive benchmark of ViT and ConvNeXt from both supervised and CLIP training framework, including the types of mistakes, synthetic data, model calibration, shape / texture bias, robustness, transferability, transformation invariance, and representation similarity.
It's unclear on how the model differences between ViT and ConvNeXT or training frameworks between supervised and CLIP contribute to the performance differences in some categories. Please check the questions below.
* Using downstream benchmarks other than a single task to assess different models is a timely topic. * The authors have explored a number of tasks. * The paper is very easy to follow and all data is shown thoroughly.
It is difficult to understand the positioning of the paper. Several of the cited works have similar insights and have done a larger exploration. Here are a few examples: * The finding that representations learnt by convolutional models and vision transformers are different has been thoroughly explored in Raghu et al 2021 * The finding that ViT models have more shape bias has been explored in Naseer et al 2021 * The finding that Clip models have higher effective robustness and better linear prob
The paper is very well-written, the experiments are detailed and thorough, they generally support the paper's conclusions. Experimental results are clearly explained. In general, I like the idea of going beyond accuracy to claim "this model is better than that one".
The paper is very empirical, lacks any kind of theoretical contribution. I don't think the paper has a lot of significance because the main question is simple and the paper doesn't provide any explanations, merely reports some differences. Minor nitpicks: ImageNet-R isn't referenced the first time it is mentioned. The last column in table 1 should indicate better it is a val accuracy, the current col title is confusing.
1. This paper is well-written and easy to follow. Even though most evaluation techniques are from previous works, I do not think that proposing a new evaluation method is not necessary to deliver a novel takeaway. 2. In the midst of a scarcity of analysis papers on ViTs and pre-trained neural networks, this paper is one of the few that offer such a comparative analysis. I believe there is value in this kind of analysis paper, especially when building intuition for developing new methods or sele
1. One of the primary weaknesses of this paper is its organization. While the paper offers various straightforward observations, they are merely presented side by side. This gives the impression of a technical report rather than an academic paper. Building connections between sections might enhance the paper's coherence and demonstrate insights. 2. Some findings have already been demonstrated by previous research. For instance, [1] showed that ViTs are shape-biased. As pointed out by [2], this c
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
MethodsConvNeXt · Vision Transformer · Contrastive Language-Image Pre-training
