TPCap: Unlocking Zero-Shot Image Captioning with Trigger-Augmented and   Multi-Modal Purification Modules

Ruoyu Zhang; Lulu Wang; Yi He; Tongling Pan; Zhengtao Yu; Yingna Li

arXiv:2502.11024·cs.CV·February 18, 2025

TPCap: Unlocking Zero-Shot Image Captioning with Trigger-Augmented and Multi-Modal Purification Modules

Ruoyu Zhang, Lulu Wang, Yi He, Tongling Pan, Zhengtao Yu, Yingna Li

PDF

Open Access

TL;DR

TPCap introduces a zero-shot image captioning framework that leverages trigger-augmented generation and multi-modal purification to improve caption accuracy without external retrieval, achieving competitive results efficiently.

Contribution

It presents a novel trigger-augmented and multi-modal purification approach that enhances zero-shot image captioning without relying on external retrieval systems.

Findings

01

Achieves competitive performance on multiple datasets.

02

Uses only 0.82M trainable parameters and one GPU.

03

Effectively improves caption quality and factual consistency.

Abstract

Recent advancements in large language models (LLMs) have significantly enhanced the fluency and logical coherence of image captioning. Retrieval-Augmented Generation (RAG) is widely adopted to incorporate external knowledge into LLMs; however, existing RAG-based methods rely on separate retrieval banks, introducing computational overhead and limiting the utilization of LLMs' inherent zero-shot capabilities. To address these limitations, we propose TPCap, a novel trigger-augmented and multi-modal purification framework for zero-shot image captioning without external retrieval libraries. TPCap consists of two key components: trigger-augmented (TA) generation and multi-modal purification (MP). The TA module employs a trigger projector with frozen and learnable projections to activate LLMs' contextual reasoning, enhance visual-textual alignment, and mitigate data bias. The MP module further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization