HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales
Michele Cafagna, Kees van Deemter, Albert Gatt

TL;DR
The HL Dataset provides a large collection of high-level, scene, action, and rationale captions for images, enabling more natural and controlled vision-language research beyond object-centric descriptions.
Contribution
This paper introduces the HL Dataset with extensive high-level annotations, confidence scores, and synthetic narratives, advancing scene understanding and captioning research.
Findings
Dataset contains 14997 images with 134,973 high-level captions.
Includes confidence scores and synthetic narrative captions.
Baseline results demonstrate the dataset's utility for high-level captioning.
Abstract
Current captioning datasets focus on object-centric captions, describing the visible objects in the image, e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict ('people at a holiday resort') and the actions they perform ('people having a picnic'). Such descriptions draw on personal experience and commonsense assumptions. We present the High-Level Dataset a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Subtitles and Audiovisual Media
