DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts

Binbin Li; Guimiao Yang; Zisen Qi; Haiping Wang; Yu Ding

arXiv:2510.24813·cs.CV·October 30, 2025

DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts

Binbin Li, Guimiao Yang, Zisen Qi, Haiping Wang, Yu Ding

PDF

TL;DR

DualCap enhances lightweight image captioning by generating visual prompts from similar images, effectively bridging the semantic gap and improving detail capture with fewer parameters.

Contribution

Introduces a dual retrieval mechanism combining text and image retrieval to generate visual prompts, enriching visual features for captioning.

Findings

01

Achieves competitive performance with fewer trainable parameters.

02

Effectively captures objects and scene details through visual prompts.

03

Outperforms previous visual-prompting approaches.

Abstract

Recent lightweight retrieval-augmented image caption models often utilize retrieved data solely as text prompts, thereby creating a semantic gap by leaving the original visual features unenhanced, particularly for object details or complex scenes. To address this limitation, we propose $D u a l C a p$ , a novel approach that enriches the visual representation by generating a visual prompt from retrieved similar images. Our model employs a dual retrieval mechanism, using standard image-to-text retrieval for text prompts and a novel image-to-image retrieval to source visually analogous scenes. Specifically, salient keywords and phrases are derived from the captions of visually similar scenes to capture key objects and similar details. These textual features are then encoded and integrated with the original image features through a lightweight, trainable feature fusion network. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.