Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols
Xu Luo, Ji Zhang, Lianli Gao, Heng Tao Shen, Jingkuan Song

TL;DR
This paper introduces FEWTRANS, a comprehensive benchmark for evaluating few-shot transfer learning, revealing that pre-trained model choice dominates performance and that simple fine-tuning often rivals complex methods, supported by mechanistic analysis.
Contribution
The paper establishes FEWTRANS benchmark and HPE protocol, providing a rigorous evaluation framework for few-shot transfer, and offers insights into the effectiveness of full fine-tuning.
Findings
Pre-trained model choice is the main performance factor.
Simple full fine-tuning rivals sophisticated transfer methods.
Multimodal models struggle in specialized domains due to linguistic rarity.
Abstract
Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation algorithms.However, there lacks a unified, rigorous evaluation protocol that is both challenging and realistic for real-world usage. In this work, we establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets, and propose the Hyperparameter Ensemble (HPE) protocol to overcome the "validation set illusion" in data-scarce regimes. Our empirical findings demonstrate that the choice of pre-trained model is the dominant factor for performance, while many sophisticated transfer methods offer negligible practical advantages over a simple full-parameter fine-tuning baseline. To explain this surprising effectiveness, we provide an in-depth mechanistic analysis showing that full fine-tuning succeeds via distributed micro-adjustments and more flexible reshaping of high-level semantic…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
Evaluation of fewshot learning algorithms is a challenge. The fact that with so many papers on the topic , there is not a consesus benchmark attests to this problem. The paper identifies an important problem. Robustness with respect to hyperparameters are mostly overlooked in transfer learning. One way to take that into account is to marginalize performance over hyperparameters, which this paper aims to do. The paper is generally easy to follow and well written.
the authors talk about having diverse dataset but yet similar data from most of these dataset can be found in imagenet. real world application of fewshot transfer learning is on out of distribution generalization and it would be nice to have an axes in this bench mark to see to what degree of dataset shift a given model can generalize to. to that end it would be good to have some in domain dataset, some near domain and some far domain datasets. see paper on paper on extreme dataset shift and th
- The need in a robust and fair benchmark for few-shot learning is imminent and the paper is addressing an important topic - Section 4.2 is spot on. The concept of one single held-out dataset for tuning model hyperparameters seems a very good way of organizing the benchmark and making sure that the model is tested across a variety of inference datasets without hyperparameter modifications. This is also a very good way of minimizing the leakage of data from the test dataset to the model. However,
- The contributions of the paper are not explicitly provided as a bullet-point list. Please include in the revised version - In Section 4.1, the paper discusses only problematic cases in the existing few-shot benchmarks literature, especially in the case of task sample size. It may create an impression that there have not been any reliable few-shot results in the literature due to all existing benchmarks being compromised. I doubt that this is true and I also doubt that this is the message that
The paper has several merits. - The motivation is good. It does a good job in terms of analyzing the limitations of current protocols in the few-shot recognition field. - The presented new protocol makes sense in general that addresses the aforementioned limitations. - The experiments are sufficient that it studies representative few-shot recognition methods.
Below are several concerns related to weaknesses. - In Introduction, the paper explains few-shot learning is "facilitating downstream scenarios where labeled data can be expensive or difficult to obtain". While few-shot learning is somewhat established in the community, can authors motivate the study of few-shot learning with real-world applications? When factually does an application need few-shot learning? - The paper writes "sampling class-imbalanced tasks". As the few-shot learning setting
- This paper presents a new benchmark for few-shot transfer evaluation and conducts extensive experiments on the proposed benchmark. - The authors identified some problems in the existing evaluations, e.g. model selection, no class imbalance, choice of hyper-parameters, etc. These insights can be useful for future research. (I have some comments for this point though - see weakness below) - The paper is overall well-written and easy to follow.
- There are a few more existing benchmarks for few-shot evaluation besides Meta-Dataset the authors mentioned. The authors should compare the proposed benchmark with them in terms of the basic data statistics, coverage of domains, sizes, etc., to give a clearer impression of the difference. E.g., Meta-Album (Ullah et al, 2022), Meta-Omnium (Bohdal et al, 2023). These datasets also contain various sub-datasets. The authors should better explain what is the unique advantage of the proposed FewTran
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Multimodal Machine Learning Applications
