LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching
Mengxiao Tian, Xinxiao Wu, Shuo Yang

TL;DR
This paper introduces an LLM-enhanced multi-modal prompt tuning approach to improve image-text matching by enabling CLIP to understand fine-grained actions, object attributes, and spatial relationships.
Contribution
It proposes a novel action-aware prompt tuning method that incorporates external knowledge from LLMs to enhance CLIP's understanding of actions and relationships in images.
Findings
Significant performance improvements on benchmark datasets
Effective encoding of action and state information
Enhanced discriminative visual representations
Abstract
Driven by large-scale contrastive vision-language pre-trained models such as CLIP, recent advancements in the image-text matching task have achieved remarkable success in representation learning. Due to image-level visual-language alignment, CLIP falls short in understanding fine-grained details such as object attributes and spatial relationships between objects. Recent efforts have attempted to compel CLIP to acquire structured visual representations by introducing prompt learning to achieve object-level alignment. While achieving promising results, they still lack the capability to perceive actions, which are crucial for describing the states or relationships between objects. Therefore, we propose to endow CLIP with fine-grained action-level understanding by introducing an LLM-enhanced action-aware multi-modal prompt-tuning method, incorporating the action-related external knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
