GroundCap: A Visually Grounded Image Captioning Dataset
Daniel A. P. Oliveira, Louren\c{c}o Teodoro, David Martins de Matos

TL;DR
GroundCap introduces a new dataset and system for visually grounded image captioning, enabling consistent object tracking and action-object linking, with improved verification and grounding accuracy.
Contribution
We present GroundCap, a large dataset with human-annotated and auto-generated captions, and a novel ID-based grounding system for better object and action linking in image captioning.
Findings
GroundCap contains 52,016 images with detailed grounding annotations.
Our system achieves higher grounding accuracy and caption verification.
Human evaluation confirms improved verifiability and coherence.
Abstract
Current image captioning systems lack the ability to link descriptive text to specific visual elements, making their outputs difficult to verify. While recent approaches offer some grounding capabilities, they cannot track object identities across multiple references or ground both actions and objects simultaneously. We propose a novel ID-based grounding system that enables consistent object reference tracking and action-object linking. We present GroundCap, a dataset containing 52,016 images from 77 movies, with 344 human-annotated and 52,016 automatically generated captions. Each caption is grounded on detected objects (132 classes) and actions (51 classes) using a tag system that maintains object identity while linking actions to the corresponding objects. Our approach features persistent object IDs for reference tracking, explicit action-object linking, and the segmentation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
