TL;DR
This paper introduces T2I-PAL, a novel method that uses text-to-image generation to bridge the modality gap in multi-label image recognition, enhancing performance without requiring fully annotated images.
Contribution
The paper proposes T2I-PAL, combining text-to-image generation, class-wise heatmaps, and prompt-adapter learning for improved multi-label recognition with reduced annotation effort.
Findings
Boosts recognition accuracy by 3.47% on benchmarks
Reduces manual annotation workload significantly
Achieves state-of-the-art performance on multiple datasets
Abstract
Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
The main idea and technical detailed are clearly presented.
The originality and technical contribution of this work is quite limited. Using synthetic data to enhance classification performance is not a new idea. Prompt tuning and adapter learning have been proposed or utilized in previous works (e.g., (Guo et al., 2022) and (Zhang et al., 2022)). The authors should give more in-depth analyses or insights.
I. This method does not require any original training images and does not suffer from performance degradation due to the modality gap caused by using only text captions. II. It achieved good results in experiments and is superior to other prompt-adapter learning methods.
I. How to ensure that the text image generation model generates high-quality synthesized data? II. Categories that are not in the vocabulary seem to have not been generated, and there are domain gaps between the synthesized and real images.
- The paper has a clear and meaningful motivation (the modality gap in PEFT-basd methods on MLR). The proposed method directly address the issue and shows good efficacy. - The experiments are extensive and results are competitive. The method shows consistent accuracy improvement compared with previous methods. The experiments also include a good amount of ablation studies that cover multiple important aspects of the design. - The paper is well organized and well written.
- The improvement over existing methods, especially TaI-DPT, is very small and whether this improvement could be attributed to other reasons, e.g. any concern on testing data leakage in Stable Diffusion's massive training data? - The proposed design is significantly more complex than TaI-DPT, which may be undesirable especially given the small performance improvement.
The paper appears to mostly be motivated by Guo 2022, "Texts as Images in Prompt Tuning for Multi-Label Image Recognition" with the innovation that a text-to-image generator could be employed to further use additional image features. The paper presents a useful study on the role of text-to-image generation in model training and tuning for multi-label image recognition. Experiments are performed on zero shot, few shot and partial label settings and show a modest improvement over Guo; though th
Overall, the biggest weakness is the presentation of the paper. It's unclear exactly what the steps of the method are. The paper is lacking a clear statement of its novelty. Unclear grammar usage further obscures the intent and makes the paper a difficult read. In the experiments, there are points that are unclear (see questions below), e.g., related to where exactly the captions come from, and how many images are generated, how many captions are synthesized, etc. Sec. B of the Appendix seemed
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAdapter · Contrastive Language-Image Pre-training · Heatmap
