Why pre-training is beneficial for downstream classification tasks?

Xin Jiang; Xu Cheng; Zechao Li

arXiv:2410.08455·cs.LG·October 14, 2024

Why pre-training is beneficial for downstream classification tasks?

Xin Jiang, Xu Cheng, Zechao Li

PDF

Open Access 5 Reviews

TL;DR

This paper explains why pre-training improves downstream classification tasks by analyzing knowledge transfer and learning efficiency through a game-theoretic approach, revealing that pre-training preserves useful knowledge that accelerates and enhances fine-tuning.

Contribution

It introduces a novel game-theoretic framework to quantitatively analyze the effects of pre-training on downstream tasks, highlighting the preservation and transfer of knowledge.

Findings

01

Pre-trained models retain a small but crucial amount of knowledge for downstream inference.

02

Such knowledge is difficult for models trained from scratch to acquire.

03

Pre-training guides models to learn target knowledge more directly and quickly.

Abstract

Pre-training has exhibited notable benefits to downstream tasks by boosting accuracy and speeding up convergence, but the exact reasons for these benefits still remain unclear. To this end, we propose to quantitatively and explicitly explain effects of pre-training on the downstream task from a novel game-theoretic view, which also sheds new light into the learning behavior of deep neural networks (DNNs). Specifically, we extract and quantify the knowledge encoded by the pre-trained model, and further track the changes of such knowledge during the fine-tuning process. Interestingly, we discover that only a small amount of pre-trained model's knowledge is preserved for the inference of downstream tasks. However, such preserved knowledge is very challenging for a model training from scratch to learn. Thus, with the help of this exclusively learned and useful knowledge, the model…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 3

Strengths

- The study covers various datasets (CUB, Stanford Cars, CIFAR-10, etc), models (both CNN and transformer) and different domains (vision and language), providing a complete analysis of the knowledge-based understanding of pre-training/fine-tuning paradigm. - This work is logically organized and covers several interesting understandings, like why pre-training/fine-tuning framework generally achieves better performance and fast convergence than training from scratch. - Key takeaways are highlighte

Weaknesses

- For section 3.2.2, whether there is any knowledge (intersections) that can only be learned through training from scratch and whether this knowledge is beneficial for the final classification. Analyzing the ratios of preserved knowledge acquired solely from training from scratch may only partially explain why pre-training and fine-tuning achieve better performance. Demonstrating that there are limited intersections unique to training from scratch would further strengthen this analysis. - Report

Reviewer 02Rating 3Confidence 5

Strengths

1. The paper investigates a longstanding question in the transfer learning area and applies recently proposed game-theoretic interaction analyses to track the behavior of DNNs during the fine-tuning process. 2. It proposes metrics to explicitly quantify the preserved and discarded knowledge in pre-trained models, as well as the preserved and newly learned knowledge after fine-tuning.

Weaknesses

1. Despite ambitious claims in the title and abstract, the manuscript primarily applies existing techniques to transfer learning, leading only to widely recognized conclusions without offering new theoretical insights or surprising experimental results. The manuscript falls short of the ICLR acceptance standard. 2. Given that fine-tuning commonly suffers from catastrophic forgetting and that improvements often stem from leveraging common pre-trained knowledge, the presented experimental results

Reviewer 03Rating 5Confidence 4

Strengths

1. The paper is well-written, presenting complex ideas clearly and intuitively. 2. The use of game theory and an 'interaction metric' to study pre-training effects is both novel and insightful. 3. Extensive experiments across various architectures validate the proposed methods, underscoring their effectiveness and applicability.

Weaknesses

1. Lack of clear motivation: The paper lacks clear motivation for using a game-theoretic approach, as similar conclusions about pre-training benefits have been reached through alternative methods, such as feature space analysis (e.g., Deng et al., 2023). Additionally, it does not provide a discussion comparing the advantages or unique insights offered by the game-theoretic approach over existing methods, leaving its added value unclear. 2. Lack of Practical Utility and Actionable Recommendation

Reviewer 04Rating 3Confidence 3

Strengths

This paper provides a systematic way to analyze pre-training benefits. The framework moves the field beyond just observing that pre-training works to understanding how and why it works.

Weaknesses

The paper's empirical verifications focus more on validating their hypotheses about how pre-training helps downstream tasks, rather than directly verifying if the interactions they find are meaningful. While the authors thoroughly verify their hypotheses about pre-training benefits, they rely more on theoretical justification and prior work for the validity of their interaction analysis method itself. The motivation for why this procedure should work is difficult to follow. Much of the text use

Reviewer 05Rating 5Confidence 4

Strengths

Instead of measuring learning by classification performance, the authors use the interaction between elements within an image. This approach offers a more interpretable view of the learning process.

Weaknesses

Some concepts in the paper are not clearly defined mathematically and lack theoretical or experimental support. For example, in Line 340, the authors mention that pre-trained high-order knowledge is not “discriminative” for downstream tasks. It would help to clarify this—such as by explaining if low interaction scores or incorrect predictions caused by the discarded pre-trained knowledge support this idea. The authors should prove the relationship between the proposed metric and the model perfo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications