Towards Understanding the Effect of Pretraining Label Granularity
Guan Zhe Hong, Yin Cui, Ariel Fuxman, Stanley H. Chan, Enming Luo

TL;DR
This paper investigates how the granularity of pretraining labels influences neural network generalization in image classification, demonstrating that fine-grained pretraining improves transfer learning performance by enabling the learning of detailed features.
Contribution
It provides both empirical evidence and theoretical explanation for the benefits of fine-grained pretraining labels in transfer learning, highlighting the importance of label hierarchy and alignment.
Findings
Fine-grained pretraining yields better transfer results.
Theoretical proof that fine-grained pretraining captures rarer features.
Proper label hierarchy and alignment are crucial for effective transfer.
Abstract
In this paper, we study how the granularity of pretraining labels affects the generalization of deep neural networks in image classification tasks. We focus on the "fine-to-coarse" transfer learning setting, where the pretraining label space is more fine-grained than that of the target problem. Empirically, we show that pretraining on the leaf labels of ImageNet21k produces better transfer results on ImageNet1k than pretraining on other coarser granularity levels, which supports the common practice used in the community. Theoretically, we explain the benefit of fine-grained pretraining by proving that, for a data distribution satisfying certain hierarchy conditions, 1) coarse-grained pretraining only allows a neural network to learn the "common" or "easy-to-learn" features well, while 2) fine-grained pretraining helps the network learn the "rarer" or "fine-grained" features in addition…
Peer Reviews
Decision·Submitted to ICLR 2024
* I believe the paper makes valid theoretical contributions which are partially supported experimentally. * The paper is easy to read and follow and the main takeaway messages easy to understand.
I am not totally sure that the idealized setup considered here makes much sense in practice. For example, Jain et al. (2023) claim that fine-grained labels are often hard and expensive to obtain and going in the coarse --> fine-grained direction is equally valuable. Moreover, when pretraining on large-scale datasets, (e.g., Mahajan et al. 2018), I believe it is often not clear what the label hierarchy is (or if it even exists). The other concern that I have is related to the transition from the
I believe the studied direction is important to understand the transferability of learning representation which corresponding to the goal of ICLR. The methodology employed in the study is theoretically driven, and it indicates a rigorous mathematical approach to understanding the effect of label granularity on DNNs. The experimental setup is well-detailed, using widely recognized datasets such as ImageNet and iNaturalist. The results section seems to provide theoretical backing with definitions
***Clarification***: my assessment are mainly focused on the empirical evidence not the theoretical conclusion. The empirical experimental results are not surprised to me, as much more fine-grained labels help to gain stronger transferable performance. I believe there are two points could be improved: - Testing on more datasets. The current results are verified on a single cross-dataset pair which not hold for other dataset pairs. There are some datasets are studied in low-shot learning could
1. The authors have provided both theoretical and experimental proofs, reinforcing the credibility of their arguments. 2. The drawn conclusion offers guidance for transfer learning, making the paper an engaging read.
1. Does the scale of the dataset influence the final performance? As the number of classes increases, the dataset scale typically expands. The authors may consider maintaining a consistent dataset scale—for instance, by having diverse classes with few samples each or limited classes with ample samples—to further substantiate their claims. 2. In Definition 4.2 regarding 'hard samples', this paper characterizes them based on the introduction of random noise. However, merely adding random noise doe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsALIGN
