Is Pre-training Truly Better Than Meta-Learning?
Brando Miranda, Patrick Yu, Saumya Goyal, Yu-Xiong Wang, Sanmi Koyejo

TL;DR
This paper empirically compares pre-training and meta-learning in few-shot learning, revealing that dataset diversity influences which approach performs better, challenging the belief that pre-training is always superior.
Contribution
The study provides a rigorous, fair comparison between PT and MAML using effect size and diversity metrics across diverse datasets, showing that dataset diversity determines the better approach.
Findings
PT outperforms MAML on low-diversity datasets.
MAML outperforms PT on high-diversity datasets.
The average difference in performance is statistically low (effect size < 0.2).
Abstract
In the context of few-shot learning, it is currently believed that a fixed pre-trained (PT) model, along with fine-tuning the final layer during evaluation, outperforms standard meta-learning algorithms. We re-evaluate these claims under an in-depth empirical examination of an extensive set of formally diverse datasets and compare PT to Model Agnostic Meta-Learning (MAML). Unlike previous work, we emphasize a fair comparison by using: the same architecture, the same optimizer, and all models trained to convergence. Crucially, we use a more rigorous statistical tool -- the effect size (Cohen's d) -- to determine the practical significance of the difference between a model trained with PT vs. a MAML. We then use a previously proposed metric -- the diversity coefficient -- to compute the average formal diversity of a dataset. Using this analysis, we demonstrate the following: 1. when the…
Peer Reviews
Decision·Submitted to ICLR 2024
The paper is sufficiently well-written. It experimental results are also quite extensive, reporting several insightful observations. There is also the development of a new statistics test, which is both novel and interesting.
Despite the above strength, I still have a few doubts regarding the proposed evaluation scheme: 1. It would be better if the authors can elaborate more on why p-value and confidence interval become zero, which in turn motivates the development of the new test 2. What is the main principle behind the new test? Specifically, if it rejects a hypothesis, what can we tell about its confidence in doing so? For example, using t-test, we are implicitly assuming the performance differences follows a st
* The problem studied has value, especially with the field moving towards PT and finetuning/linear evaluation over metalearning approaches. Studying the problem in the context of the task diversity is particularly interesting, and yields some intuitive results -- we would possibly expect metalearning to perform better in the context of high diversity (over pretraining). * discussion of why cohen's d was used is interesting -- paper has clearly put thought into choosing appropriate statistical
* Choice of MAML model: the original MAML model has been developed significantly in recent years. It would be fairer to that class of methods to compare to one of these many developments, given they have demonstrated (in general) better performance. * A discussion of the drawbacks of effect size -- it is useful to understand where this may be inadequate (except the fact that one must choose a threshold level). Relatedly, the paper states that standard effect sizes are 0.2, 0.5, and 0.8, but it
1. The paper uses an effect size to compare pre-training and meta-learning for the first time, which enables us to compare two methods quantitatively. 2. The paper empirically validates that the task diversity is a key property distinguishing pre-training and meta-learning, which is consistent with the motivation of meta-learning.
1. No novel methods are introduced in this paper, which itself is ok if the results are intriguing. 2. Presentation of the results is very poor. All results are just listed in tables, and there are no attempts to present the results in a comprehensible/impressive manner. Since I could not find any meaningful insight from the tables, I recommend adding more comprehensive figures which should be contributions of the paper. 3. The results of Cohen's d can be caused from (1) meta-training dataset, a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Machine Learning in Healthcare
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Cosine Annealing · Discriminative Fine-Tuning · Linear Layer · Layer Normalization · Linear Warmup With Cosine Annealing · Attention Dropout · Dense Connections
