RORPCap: Retrieval-based Objects and Relations Prompt for Image Captioning

Jinjing Gu; Tianbao Qin; Yuanyuan Pu; Zhengpeng Zhao

arXiv:2508.07318·cs.CV·August 12, 2025

RORPCap: Retrieval-based Objects and Relations Prompt for Image Captioning

Jinjing Gu, Tianbao Qin, Yuanyuan Pu, Zhengpeng Zhao

PDF

Open Access

TL;DR

RORPCap introduces a retrieval-based prompt method for image captioning that leverages object and relation extraction, enabling efficient training and competitive performance without relying on detectors or GCNs.

Contribution

The paper proposes RORPCap, a novel retrieval-based prompt approach that simplifies image captioning by extracting semantic information and using prompt embeddings, reducing training time and maintaining high accuracy.

Findings

01

Achieves 120.5% CIDEr score on MS-COCO with only 2.6 hours of training.

02

Requires no object detectors or GCNs, reducing complexity and training costs.

03

Demonstrates competitive performance compared to detector-based models.

Abstract

Image captioning aims to generate natural language descriptions for input images in an open-form manner. To accurately generate descriptions related to the image, a critical step in image captioning is to identify objects and understand their relations within the image. Modern approaches typically capitalize on object detectors or combine detectors with Graph Convolutional Network (GCN). However, these models suffer from redundant detection information, difficulty in GCN construction, and high training costs. To address these issues, a Retrieval-based Objects and Relations Prompt for Image Captioning (RORPCap) is proposed, inspired by the fact that image-text retrieval can provide rich semantic information for input images. RORPCap employs an Objects and relations Extraction Model to extract object and relation words from the image. These words are then incorporate into predefined…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis