SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models
Yunlin Zeng

TL;DR
This paper introduces SPoRC-VIST, a benchmark and evaluation framework for assessing the quality of generative vision-language models in creating engaging, long-form visual narratives like podcast dialogues, emphasizing naturalness and storytelling.
Contribution
It presents a novel pipeline for visual podcast generation, a synthetic-to-real training strategy, and a comprehensive evaluation framework using AI judges and style metrics.
Findings
Fine-tuned 32B model outperforms 235B base in naturalness and narrative depth.
Model maintains visual grounding capabilities comparable to existing models.
Synthetic training data effectively generalizes to real-world visual storytelling.
Abstract
Vision-Language Models (VLMs) have achieved remarkable success in descriptive tasks such as image captioning and visual question answering (VQA). However, their ability to generate engaging, long-form narratives -- specifically multi-speaker podcast dialogues -- remains under-explored and difficult to evaluate. Standard metrics like BLEU and ROUGE fail to capture the nuances of conversational naturalness, personality, and narrative flow, often rewarding safe, repetitive outputs over engaging storytelling. In this work, we present a novel pipeline for end-to-end visual podcast generation, and fine-tune a Qwen3-VL-32B model on a curated dataset of 4,000 image-dialogue pairs. Crucially, we use a synthetic-to-real training strategy: we train on high-quality podcast dialogues from the Structured Podcast Research Corpus (SPoRC) paired with synthetically generated imagery, and evaluate on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
