Multi-Level Contrastive Learning for Dense Prediction Task
Qiushan Guo, Yizhou Yu, Yi Jiang, Jiannan Wu, Zehuan Yuan, Ping Luo

TL;DR
This paper introduces Multi-Level Contrastive Learning (MCL), a self-supervised approach for dense prediction that encodes region-level features by assembling multi-scale images, outperforming state-of-the-art methods on COCO detection benchmarks.
Contribution
The paper proposes a novel multi-level contrastive loss and a montage-based pretext task that explicitly encodes position and scale, improving dense prediction pre-training efficiency and effectiveness.
Findings
MCL achieves 42.5 AP on COCO with 100 epochs pre-training.
MCL surpasses MoCo by 4.0 AP on COCO detection.
Extending the pretext task to supervised pre-training yields similar results.
Abstract
In this work, we present Multi-Level Contrastive Learning for Dense Prediction Task (MCL), an efficient self-supervised method for learning region-level feature representation for dense prediction tasks. Our method is motivated by the three key factors in detection: localization, scale consistency and recognition. To explicitly encode absolute position and scale information, we propose a novel pretext task that assembles multi-scale images in a montage manner to mimic multi-object scenarios. Unlike the existing image-level self-supervised methods, our method constructs a multi-level contrastive loss that considers each sub-region of the montage image as a singleton. Our method enables the neural network to learn regional semantic representations for translation and scale consistency while reducing pre-training epochs to the same as supervised pre-training. Extensive experiments…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
- Expanding the dataset by performing scaling and stitching operations at the raw data level is an interesting concept, and the interaction between different sub-images within the same input image is intriguing. - This paper has conducted extensive experiments across a wide range of downstream tasks. - This paper is well-written and logically structured.
- The idea of Montage Assembly shares similarities with MultiCrop[1], which constructs inputs of two different resolutions and performs contrastive learning on the output features. Montage Assembly, however, constructs a montage of multiple resolutions. Thus, an initial analysis should consider the effect of stacking images in the batch dimension instead of the spatial dimension, retaining structure to obtain the same features for contrastive learning, and whether this leads to a difference in e
* Previous works have indicated that multi-positive contrastive training can greatly improve the training efficiency of SSL, this paper find a way to incorporate mutli-scale information into this procedure by stitching multiple images and arange their scales properly to achive the goal. * The paper is clearly written. * According to "Rethinking ImageNet Pre-training" of Kaiming He, Faster R-CNN with rand init and 6x schedules can achieves 41.3 box AP, and mask r-cnn generally brings about ~1 AP
* The idea of stitching images together to improve the detection has been verified in paper "Dynamic Scale Training for Object Detection" of Yukang Chen, which is basically the same as this paper, so I doubt the novelty in this paper MCL, although it is disigned for pretraining of object detection. I think then shares similar motivations. * Most of previous contrastive SSL works uses Faster R-CNN as downstream baseline, not Mask R-CNN. So I think the author should also report results on Faster R
i) The motivation is reasonable and the writing is good ii) The proposed method conducts extensive experiments to verify its effectiveness. The experiment results seem good and outperform several related works. iii) The ablation study is sufficient, which discusses the impact of different sub-modules.
i) The idea of introducing multi-level features in contrastive learning to improve dense prediction tasks is not novel. It has been used in several previous works. ii) The author thinks that previous contrastive learning only considers image-level feature alignment. However, in MCL, the feature of sub-image feature in the montage assembled image is also an image-level feature. I think MCL does not really apply object-level feature alignment.
- The idea of composing images in a montage style is interesting. - The paper is generally well-written and easy to follow. - The downstream evaluation experiments are extensive and the results seem promising.
- **Effectiveness on non-object-centric datasets.** In the introduction section, the authors claim that object-level SSL methods use off-the-shelf algorithms to produce object proposals, which are not accurate enough on non-object-centric datasets like COCO. To tackle this issue, the authors should mainly conduct pre-training experiments on COCO to compare with previous object-level SSL methods. However, in the paper, the authors mainly pre-train MCL on the object-centric ImageNet dataset. From
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Video Surveillance and Tracking Methods · Remote-Sensing Image Classification
MethodsInfoNCE · Convolution · Region Proposal Network · Softmax · RoIAlign · Batch Normalization · Momentum Contrast · Mask R-CNN · Contrastive Learning
