GroundCap: A Visually Grounded Image Captioning Dataset

Daniel A. P. Oliveira; Louren\c{c}o Teodoro; David Martins de Matos

arXiv:2502.13898·cs.CV·June 27, 2025

GroundCap: A Visually Grounded Image Captioning Dataset

Daniel A. P. Oliveira, Louren\c{c}o Teodoro, David Martins de Matos

PDF

Open Access 1 Models 1 Datasets

TL;DR

GroundCap introduces a new dataset and system for visually grounded image captioning, enabling consistent object tracking and action-object linking, with improved verification and grounding accuracy.

Contribution

We present GroundCap, a large dataset with human-annotated and auto-generated captions, and a novel ID-based grounding system for better object and action linking in image captioning.

Findings

01

GroundCap contains 52,016 images with detailed grounding annotations.

02

Our system achieves higher grounding accuracy and caption verification.

03

Human evaluation confirms improved verifiability and coherence.

Abstract

Current image captioning systems lack the ability to link descriptive text to specific visual elements, making their outputs difficult to verify. While recent approaches offer some grounding capabilities, they cannot track object identities across multiple references or ground both actions and objects simultaneously. We propose a novel ID-based grounding system that enables consistent object reference tracking and action-object linking. We present GroundCap, a dataset containing 52,016 images from 77 movies, with 344 human-annotated and 52,016 automatically generated captions. Each caption is grounded on detected objects (132 classes) and actions (51 classes) using a tag system that maintains object identity while linking actions to the corresponding objects. Our approach features persistent object IDs for reference tracking, explicit action-object linking, and the segmentation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
daniel3303/PixtralGroundCap
model· 8 dl
8 dl

Datasets

daniel3303/GroundCap
dataset· 82 dl
82 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization