Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning

Chun-Mei Feng; Kai Yu; Xinxing Xu; Salman Khan; Rick Siow Mong Goh; Wangmeng Zuo; Yong Liu

arXiv:2506.10575·cs.CV·June 13, 2025

Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning

Chun-Mei Feng, Kai Yu, Xinxing Xu, Salman Khan, Rick Siow Mong Goh, Wangmeng Zuo, Yong Liu

PDF

4 Reviews

TL;DR

This paper introduces T2I-PAL, a novel method that uses text-to-image generation to bridge the modality gap in multi-label image recognition, enhancing performance without requiring fully annotated images.

Contribution

The paper proposes T2I-PAL, combining text-to-image generation, class-wise heatmaps, and prompt-adapter learning for improved multi-label recognition with reduced annotation effort.

Findings

01

Boosts recognition accuracy by 3.47% on benchmarks

02

Reduces manual annotation workload significantly

03

Achieves state-of-the-art performance on multiple datasets

Abstract

Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 3· reject, not good enoughConfidence 4

Strengths

The main idea and technical detailed are clearly presented.

Weaknesses

The originality and technical contribution of this work is quite limited. Using synthetic data to enhance classification performance is not a new idea. Prompt tuning and adapter learning have been proposed or utilized in previous works (e.g., (Guo et al., 2022) and (Zhang et al., 2022)). The authors should give more in-depth analyses or insights.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

I. This method does not require any original training images and does not suffer from performance degradation due to the modality gap caused by using only text captions. II. It achieved good results in experiments and is superior to other prompt-adapter learning methods.

Weaknesses

I. How to ensure that the text image generation model generates high-quality synthesized data? II. Categories that are not in the vocabulary seem to have not been generated, and there are domain gaps between the synthesized and real images.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

- The paper has a clear and meaningful motivation (the modality gap in PEFT-basd methods on MLR). The proposed method directly address the issue and shows good efficacy. - The experiments are extensive and results are competitive. The method shows consistent accuracy improvement compared with previous methods. The experiments also include a good amount of ablation studies that cover multiple important aspects of the design. - The paper is well organized and well written.

Weaknesses

- The improvement over existing methods, especially TaI-DPT, is very small and whether this improvement could be attributed to other reasons, e.g. any concern on testing data leakage in Stable Diffusion's massive training data? - The proposed design is significantly more complex than TaI-DPT, which may be undesirable especially given the small performance improvement.

Reviewer 04Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

The paper appears to mostly be motivated by Guo 2022, "Texts as Images in Prompt Tuning for Multi-Label Image Recognition" with the innovation that a text-to-image generator could be employed to further use additional image features. The paper presents a useful study on the role of text-to-image generation in model training and tuning for multi-label image recognition. Experiments are performed on zero shot, few shot and partial label settings and show a modest improvement over Guo; though th

Weaknesses

Overall, the biggest weakness is the presentation of the paper. It's unclear exactly what the steps of the method are. The paper is lacking a clear statement of its novelty. Unclear grammar usage further obscures the intent and makes the paper a difficult read. In the experiments, there are points that are unclear (see questions below), e.g., related to where exactly the captions come from, and how many images are generated, how many captions are synthesized, etc. Sec. B of the Appendix seemed

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAdapter · Contrastive Language-Image Pre-training · Heatmap