StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation
Daniel A. P. Oliveira, David Martins de Matos

TL;DR
This paper introduces the StoryReasoning dataset and a novel approach for grounded visual storytelling that maintains character consistency and reduces hallucinations through chain-of-thought reasoning and cross-frame visual re-identification.
Contribution
It presents a new dataset with structured scene analysis and grounded stories, and a fine-tuned model that improves scene understanding and story coherence.
Findings
Reduced hallucinations by 12.3% on average.
Improved creativity scores by 31%.
Demonstrated effective cross-frame object re-identification.
Abstract
Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for explicit narrative modeling, and a grounding scheme that links textual elements to visual entities across multiple frames. We establish baseline performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods
