TPCap: Unlocking Zero-Shot Image Captioning with Trigger-Augmented and Multi-Modal Purification Modules
Ruoyu Zhang, Lulu Wang, Yi He, Tongling Pan, Zhengtao Yu, Yingna Li

TL;DR
TPCap introduces a zero-shot image captioning framework that leverages trigger-augmented generation and multi-modal purification to improve caption accuracy without external retrieval, achieving competitive results efficiently.
Contribution
It presents a novel trigger-augmented and multi-modal purification approach that enhances zero-shot image captioning without relying on external retrieval systems.
Findings
Achieves competitive performance on multiple datasets.
Uses only 0.82M trainable parameters and one GPU.
Effectively improves caption quality and factual consistency.
Abstract
Recent advancements in large language models (LLMs) have significantly enhanced the fluency and logical coherence of image captioning. Retrieval-Augmented Generation (RAG) is widely adopted to incorporate external knowledge into LLMs; however, existing RAG-based methods rely on separate retrieval banks, introducing computational overhead and limiting the utilization of LLMs' inherent zero-shot capabilities. To address these limitations, we propose TPCap, a novel trigger-augmented and multi-modal purification framework for zero-shot image captioning without external retrieval libraries. TPCap consists of two key components: trigger-augmented (TA) generation and multi-modal purification (MP). The TA module employs a trigger projector with frozen and learnable projections to activate LLMs' contextual reasoning, enhance visual-textual alignment, and mitigate data bias. The MP module further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
