SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models

Yunlin Zeng

arXiv:2601.01062·cs.LG·January 6, 2026

SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models

Yunlin Zeng

PDF

Open Access

TL;DR

This paper introduces SPoRC-VIST, a benchmark and evaluation framework for assessing the quality of generative vision-language models in creating engaging, long-form visual narratives like podcast dialogues, emphasizing naturalness and storytelling.

Contribution

It presents a novel pipeline for visual podcast generation, a synthetic-to-real training strategy, and a comprehensive evaluation framework using AI judges and style metrics.

Findings

01

Fine-tuned 32B model outperforms 235B base in naturalness and narrative depth.

02

Model maintains visual grounding capabilities comparable to existing models.

03

Synthetic training data effectively generalizes to real-world visual storytelling.

Abstract

Vision-Language Models (VLMs) have achieved remarkable success in descriptive tasks such as image captioning and visual question answering (VQA). However, their ability to generate engaging, long-form narratives -- specifically multi-speaker podcast dialogues -- remains under-explored and difficult to evaluate. Standard metrics like BLEU and ROUGE fail to capture the nuances of conversational naturalness, personality, and narrative flow, often rewarding safe, repetitive outputs over engaging storytelling. In this work, we present a novel pipeline for end-to-end visual podcast generation, and fine-tune a Qwen3-VL-32B model on a curated dataset of 4,000 image-dialogue pairs. Crucially, we use a synthetic-to-real training strategy: we train on high-quality podcast dialogues from the Structured Podcast Research Corpus (SPoRC) paired with synthetically generated imagery, and evaluate on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling