Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence

Nikolai Ilinykh; Hyewon Jang; Shalom Lappin; Asad Sayeed; Sharid Lo\'aiciga

arXiv:2603.25537·cs.CL·March 27, 2026

Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence

Nikolai Ilinykh, Hyewon Jang, Shalom Lappin, Asad Sayeed, Sharid Lo\'aiciga

PDF

Open Access

TL;DR

This paper compares human and vision-language model narratives on coherence aspects, revealing that models, despite fluent surface language, organize stories differently from humans in discourse and visual grounding.

Contribution

It introduces a comprehensive set of metrics for narrative coherence and systematically compares human and model-generated stories using these measures.

Findings

01

Models show similar coherence profiles but differ systematically from humans.

02

Differences are subtle individually but clearer when measures are combined.

03

Model narratives differ from human stories in discourse organization and visual grounding.

Abstract

We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Data Visualization and Analytics