TL;DR
This paper introduces MLCD, a multi-label cluster discrimination method that improves visual representation learning by capturing multiple semantic signals in images, outperforming previous single-label approaches.
Contribution
The paper proposes a novel multi-label cluster discrimination technique that leverages multiple pseudo-labels per image to better encode semantic structures in visual representations.
Findings
Achieves state-of-the-art results on downstream tasks
Effectively captures multi-object and attribute signals in images
Improves over single-label clustering methods
Abstract
Contrastive Language Image Pre-training (CLIP) has recently demonstrated success across various tasks due to superior feature representation empowered by image-text contrastive learning. However, the instance discrimination method used by CLIP can hardly encode the semantic structure of training data. To handle this limitation, cluster discrimination has been proposed through iterative cluster assignment and classification. Nevertheless, most cluster discrimination approaches only define a single pseudo-label for each image, neglecting multi-label signals in the image. In this paper, we propose a novel Multi-Label Cluster Discrimination method named MLCD to enhance representation learning. In the clustering step, we first cluster the large-scale LAION-400M dataset into one million centers based on off-the-shelf embedding features. Considering that natural images frequently contain…
Peer Reviews
Decision·Submitted to ICLR 2024
Originality: This work extends the discrimination power of CLIP model by introducing a multi-label loss to boost the semantic learning ability of the vision-language model. Quality: The improvement achieved by the proposed method is remarkable on certain datasets, and the ablative study provides comprehensive and detailed insights into its functioning. Clarity: This paper is reader-friendly and smooth. The experimental setting is quite reasonable. Significance: This paper shows the benefit of us
(1) The novelty and originality of this work are limited. It seems like the method proposed in this paper incorporates several techniques introduced in the literature. It does not offer sufficient technical inspirations for the readers to follow. (2) With respect to the limited technical novelty of this work and overall moderate improvement (I see in Table 1 and Table 2), it may not seem to be worthwhile using such huge computing resources (80 NVIDIA A100 GPUs), especially considering that visu
The algorithm proposed in this paper is intuitive and effective in learning discriminative imagery feature representations. The analysis and decomposition of triplet loss make sense to me and are potentially beneficial to a wide range of applications as a generic improvement to a widely adopted metric learning design.
+ my key concern on the high-level idea is whether the top-k closest clusters to an image can really reveal what objects/attributes (will use “concepts” for clarity hereafter) are involved in it. At the cluster level, samples of the same clusters are likely to share more nearest clusters (in a global picture) but the concepts involved in each independent image are almost random. Is it possible that the multiple labels assigned to the same images provide models with additional knowledge about the
1. This work considers the multi-label properties of a single image and emphasizes the learning of better semantic structure in data. 2. The designed loss function elegantly separates the loss from positive and negative classes, which enhances the parallelism and scalability during training. 3. The experiments in this work are extensive and convincing, with thorough ablation studies.
1. Clarity: - This manuscript requires further refinement in terms of writing to facilitate reader comprehension, particularly by providing detailed explanations for the mathematical symbols used in the text, thus reducing reading barriers. 2. Experiments: - In section $3.2, the authors claim efficient parallel computation and scalability of the model training process. However, is there quantitative data to support this point? - Does the incorporation of clustering significantly impro
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training
