From Image Captioning to Visual Storytelling

Admitos Passadakis; Yingjin Song; Albert Gatt

arXiv:2508.14045·cs.CL·August 21, 2025

From Image Captioning to Visual Storytelling

Admitos Passadakis, Yingjin Song, Albert Gatt

PDF

Open Access

TL;DR

This paper presents a unified framework for visual storytelling that combines image captioning and narrative generation, improving story quality, training efficiency, and introducing a new metric for evaluating storytelling models.

Contribution

It introduces a novel approach that treats visual storytelling as an extension of image captioning, enhancing coherence and grounding while reducing training time.

Findings

01

Integrated captioning and storytelling improves story quality.

02

The framework is faster to train and more reproducible.

03

The new ideality metric correlates with human-likeness.

Abstract

Visual Storytelling is a challenging multimodal task between Vision & Language, where the purpose is to generate a story for a stream of images. Its difficulty lies on the fact that the story should be both grounded to the image sequence but also narrative and coherent. The aim of this work is to balance between these aspects, by treating Visual Storytelling as a superset of Image Captioning, an approach quite different compared to most of prior relevant studies. This means that we firstly employ a vision-to-language model for obtaining captions of the input images, and then, these captions are transformed into coherent narratives using language-to-language methods. Our multifarious evaluation shows that integrating captioning and storytelling under a unified framework, has a positive impact on the quality of the produced stories. In addition, compared to numerous previous studies, this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Multimodal Machine Learning Applications