HL Dataset: Visually-grounded Description of Scenes, Actions and   Rationales

Michele Cafagna; Kees van Deemter; Albert Gatt

arXiv:2302.12189·cs.CL·September 26, 2023·1 cites

HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales

Michele Cafagna, Kees van Deemter, Albert Gatt

PDF

Open Access 1 Repo 3 Models 2 Datasets

TL;DR

The HL Dataset provides a large collection of high-level, scene, action, and rationale captions for images, enabling more natural and controlled vision-language research beyond object-centric descriptions.

Contribution

This paper introduces the HL Dataset with extensive high-level annotations, confidence scores, and synthetic narratives, advancing scene understanding and captioning research.

Findings

01

Dataset contains 14997 images with 134,973 high-level captions.

02

Includes confidence scores and synthetic narrative captions.

03

Baseline results demonstrate the dataset's utility for high-level captioning.

Abstract

Current captioning datasets focus on object-centric captions, describing the visible objects in the image, e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict ('people at a holiday resort') and the actions they perform ('people having a picnic'). Such descriptions draw on personal experience and commonsense assumptions. We present the High-Level Dataset a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

michelecafagna26/hl-dataset
noneOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Subtitles and Audiovisual Media