ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models

Danae S\'anchez Villegas; Ingo Ziegler; Desmond Elliott

arXiv:2502.19409·cs.CV·June 12, 2025

ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models

Danae S\'anchez Villegas, Ingo Ziegler, Desmond Elliott

PDF

Open Access 1 Datasets

TL;DR

ImageChain enhances multimodal large language models with sequential reasoning over image sequences by modeling visual data as multi-turn conversations, significantly improving next-scene description accuracy and zero-shot out-of-domain performance.

Contribution

The paper introduces ImageChain, a novel framework that enables multimodal models to perform temporal reasoning over image sequences through multi-turn dialogue modeling.

Findings

01

Improves next-scene description accuracy from 3.7% to 19% in SimRate.

02

Achieves robust zero-shot performance across diverse domains.

03

Validates instruction-tuning in multimodal multi-turn conversations as essential for temporal reasoning.

Abstract

Reasoning over sequences of images remains a challenge for multimodal large language models (MLLMs). While recent models incorporate multi-image data during pre-training, they still struggle to recognize sequential structures, often treating images independently. This work introduces ImageChain, a framework that enhances MLLMs with sequential reasoning capabilities over image data by modeling visual sequences as a multi-turn conversation. In ImageChain, images are interleaved with corresponding textual descriptions to form a controlled dialogue that explicitly captures temporal dependencies and narrative progression. Our method optimizes for the task of next-scene description, where the model generates a context-aware description of an upcoming scene based on preceding visual and textual cues. We demonstrate that our approach improves performance on the next-scene description task --…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ingoziegler/StoryFrames
dataset· 924 dl
924 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling