RoViST:Learning Robust Metrics for Visual Storytelling
Eileen Wang, Caren Han, Josiah Poon

TL;DR
This paper introduces three new evaluation metrics for visual storytelling that better align with human judgment by analyzing visual grounding, coherence, and non-redundancy, addressing limitations of traditional n-gram based metrics.
Contribution
It proposes a novel set of learning-based evaluation metrics for visual storytelling that outperform existing metrics in correlating with human judgments.
Findings
Metrics outperform others in human correlation
Metrics analyze visual grounding, coherence, non-redundancy
Applicable to models trained on VIST dataset
Abstract
Visual storytelling (VST) is the task of generating a story paragraph that describes a given image sequence. Most existing storytelling approaches have evaluated their models using traditional natural language generation metrics like BLEU or CIDEr. However, such metrics based on n-gram matching tend to have poor correlation with human evaluation scores and do not explicitly consider other criteria necessary for storytelling such as sentence structure or topic coherence. Moreover, a single score is not enough to assess a story as it does not inform us about what specific errors were made by the model. In this paper, we propose 3 evaluation metrics sets that analyses which aspects we would look for in a good story: 1) visual grounding, 2) coherence, and 3) non-redundancy. We measure the reliability of our metric sets by analysing its correlation with human judgement scores on a sample of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Digital Storytelling and Education
