Text as Any-Modality for Zero-Shot Classification by Consistent Prompt Tuning
Xiangyu Wu, Feng Yu, Yang Yang, Jianfeng Lu

TL;DR
TaAM-CPT introduces a scalable, text-based approach for zero-shot classification across unlimited modalities using prompt tuning and pre-trained models, eliminating the need for modality-specific labeled data.
Contribution
It proposes a novel framework that extends to any modality by adding prompt pools and aligned encoders, enabling multimodal generalization without labeled data.
Findings
Achieves leading results on video, image, and audio classification datasets.
Operates without modality-specific labeled data.
Supports unlimited modalities through scalable architecture.
Abstract
The integration of prompt tuning with multimodal learning has shown significant generalization abilities for various downstream tasks. Despite advancements, existing methods heavily depend on massive modality-specific labeled data (e.g., video, audio, and image), or are customized for a single modality. In this study, we present Text as Any-Modality by Consistent Prompt Tuning (TaAM-CPT), a scalable approach for constructing a general representation model toward unlimited modalities using solely text data. TaAM-CPT comprises modality prompt pools, text construction, and modality-aligned text encoders from pre-trained models, which allows for extending new modalities by simply adding prompt pools and modality-aligned text encoders. To harmonize the learning across different modalities, TaAM-CPT designs intra- and inter-modal learning objectives, which can capture category details within…
Peer Reviews
Decision·Submitted to ICLR 2025
1. This paper focuses on representation in a multimodal setting, which is a interesting and important fields. This paper also utilize important tools of contrastive loss for inter-model learning and ranking loss for intra-model learning. 2. The training process is efficient, relying only on a prompt pool as the training parameter and using pretrained models like CLIP, CLAP, and LLM to eliminate the need for complex data collection. 3. The experiences include both objective metrics and visual fig
The main issue lies in the novelty and practical functionality for this fields: 1. The contrastive loss and ranking loss are not new; they have been applied to multimodal representation learning [1, 2] for some time. 2. The application of this paper is currently limited to relatively simple classification tasks, for which many existing tools perform well. I encourage the authors to include additional tasks, such as conditional image or audio generation using the novaly class. 3. The prompt pool
- The authors present a simplistic approach to prompt building and describe the build process in detail. - The authors propose a uni-directional contrastive loss to facilitate intermodal training. - TaAM-CPT effectively integrates the image/audio/video modalities and achieves competitive performance on classification tasks.
- Making prompts requires inserting **{Label}**, does this mean that different pools of prompts need to be designed for different datasets? - LLM's hallucinations may affect the quality of prompt generation, does TaAM-CPT have a process for quality checking during prompt production? - TaAM-CPT needs to adjust Inter-modal Learning based on validation performance, but I noticed that some of the dataset's validation set is being used for evaluation, is there an information leak? - Based on the prev
Simple approach: Compared to TaI-DPT (Guo et al., 2023) and follow-up work, the method presented in this paper is simpler as it does not involve complex multi-grained prompts. TaAM-CPT only uses a single prompt per category per modality which simplifies the approach. Code availability: The authors of TaAM-CPT released the code for their implementation which is very valuable for reproducibility.
Significance of the quantitative improvement: The inter-modal unidirectional contrastive learning is the main contribution claimed by this paper. It is ablated in Table 7 and Table 10. But without a statistical analysis of the results it is hard to evaluate the significance of the improvement. L959: "when all modalities are trained together, the performance of each modality can be further improved." This does not seem to be the case though. Comparing to independently training each modality, tra
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Topic Modeling · Handwritten Text Recognition Techniques
