LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching

Mengxiao Tian; Xinxiao Wu; Shuo Yang

arXiv:2506.23502·cs.CV·July 15, 2025

LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching

Mengxiao Tian, Xinxiao Wu, Shuo Yang

PDF

TL;DR

This paper introduces an LLM-enhanced multi-modal prompt tuning approach to improve image-text matching by enabling CLIP to understand fine-grained actions, object attributes, and spatial relationships.

Contribution

It proposes a novel action-aware prompt tuning method that incorporates external knowledge from LLMs to enhance CLIP's understanding of actions and relationships in images.

Findings

01

Significant performance improvements on benchmark datasets

02

Effective encoding of action and state information

03

Enhanced discriminative visual representations

Abstract

Driven by large-scale contrastive vision-language pre-trained models such as CLIP, recent advancements in the image-text matching task have achieved remarkable success in representation learning. Due to image-level visual-language alignment, CLIP falls short in understanding fine-grained details such as object attributes and spatial relationships between objects. Recent efforts have attempted to compel CLIP to acquire structured visual representations by introducing prompt learning to achieve object-level alignment. While achieving promising results, they still lack the capability to perceive actions, which are crucial for describing the states or relationships between objects. Therefore, we propose to endow CLIP with fine-grained action-level understanding by introducing an LLM-enhanced action-aware multi-modal prompt-tuning method, incorporating the action-related external knowledge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.