Text as Any-Modality for Zero-Shot Classification by Consistent Prompt Tuning

Xiangyu Wu; Feng Yu; Yang Yang; Jianfeng Lu

arXiv:2508.06382·cs.CV·August 11, 2025

Text as Any-Modality for Zero-Shot Classification by Consistent Prompt Tuning

Xiangyu Wu, Feng Yu, Yang Yang, Jianfeng Lu

PDF

Open Access 3 Reviews

TL;DR

TaAM-CPT introduces a scalable, text-based approach for zero-shot classification across unlimited modalities using prompt tuning and pre-trained models, eliminating the need for modality-specific labeled data.

Contribution

It proposes a novel framework that extends to any modality by adding prompt pools and aligned encoders, enabling multimodal generalization without labeled data.

Findings

01

Achieves leading results on video, image, and audio classification datasets.

02

Operates without modality-specific labeled data.

03

Supports unlimited modalities through scalable architecture.

Abstract

The integration of prompt tuning with multimodal learning has shown significant generalization abilities for various downstream tasks. Despite advancements, existing methods heavily depend on massive modality-specific labeled data (e.g., video, audio, and image), or are customized for a single modality. In this study, we present Text as Any-Modality by Consistent Prompt Tuning (TaAM-CPT), a scalable approach for constructing a general representation model toward unlimited modalities using solely text data. TaAM-CPT comprises modality prompt pools, text construction, and modality-aligned text encoders from pre-trained models, which allows for extending new modalities by simply adding prompt pools and modality-aligned text encoders. To harmonize the learning across different modalities, TaAM-CPT designs intra- and inter-modal learning objectives, which can capture category details within…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

1. This paper focuses on representation in a multimodal setting, which is a interesting and important fields. This paper also utilize important tools of contrastive loss for inter-model learning and ranking loss for intra-model learning. 2. The training process is efficient, relying only on a prompt pool as the training parameter and using pretrained models like CLIP, CLAP, and LLM to eliminate the need for complex data collection. 3. The experiences include both objective metrics and visual fig

Weaknesses

The main issue lies in the novelty and practical functionality for this fields: 1. The contrastive loss and ranking loss are not new; they have been applied to multimodal representation learning [1, 2] for some time. 2. The application of this paper is currently limited to relatively simple classification tasks, for which many existing tools perform well. I encourage the authors to include additional tasks, such as conditional image or audio generation using the novaly class. 3. The prompt pool

Reviewer 02Rating 5Confidence 4

Strengths

- The authors present a simplistic approach to prompt building and describe the build process in detail. - The authors propose a uni-directional contrastive loss to facilitate intermodal training. - TaAM-CPT effectively integrates the image/audio/video modalities and achieves competitive performance on classification tasks.

Weaknesses

- Making prompts requires inserting **{Label}**, does this mean that different pools of prompts need to be designed for different datasets? - LLM's hallucinations may affect the quality of prompt generation, does TaAM-CPT have a process for quality checking during prompt production? - TaAM-CPT needs to adjust Inter-modal Learning based on validation performance, but I noticed that some of the dataset's validation set is being used for evaluation, is there an information leak? - Based on the prev

Reviewer 03Rating 5Confidence 3

Strengths

Simple approach: Compared to TaI-DPT (Guo et al., 2023) and follow-up work, the method presented in this paper is simpler as it does not involve complex multi-grained prompts. TaAM-CPT only uses a single prompt per category per modality which simplifies the approach. Code availability: The authors of TaAM-CPT released the code for their implementation which is very valuable for reproducibility.

Weaknesses

Significance of the quantitative improvement: The inter-modal unidirectional contrastive learning is the main contribution claimed by this paper. It is ablated in Table 7 and Table 10. But without a statistical analysis of the results it is hard to evaluate the significance of the improvement. L959: "when all modalities are trained together, the performance of each modality can be further improved." This does not seem to be the case though. Comparing to independently training each modality, tra

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Topic Modeling · Handwritten Text Recognition Techniques