ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

Sibo Dong; Ismail Shaheen; Maggie Shen; Rupayan Mallick; Sarah Adel Bargal

arXiv:2506.12198·cs.CV·January 19, 2026

ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

Sibo Dong, Ismail Shaheen, Maggie Shen, Rupayan Mallick, Sarah Adel Bargal

PDF

Open Access

TL;DR

ViSTA introduces a multi-modal history adapter for text-to-image diffusion models to generate coherent, narrative-aligned image sequences in visual storytelling, addressing previous limitations in adaptability and consistency.

Contribution

It proposes a novel multi-modal history adapter and salient history selection strategy for improved visual storytelling with diffusion models.

Findings

01

Achieves coherent and narrative-aligned image sequences.

02

Outperforms existing methods on StorySalon and FlintStonesSV datasets.

03

Provides a new metric, TIFA, for assessing text-image alignment.

Abstract

Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, \textbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications