Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning
Daniel A. P. Oliveira, David Martins de Matos

TL;DR
This paper introduces a contrastive reinforcement learning method to improve entity re-identification and consistency in visual storytelling models, enhancing their ability to maintain character and object references across frames.
Contribution
It presents a novel contrastive reinforcement learning approach that explicitly trains models to connect entities across frames, using synthetic negative examples and a dual reward function.
Findings
Grounding mAP improved from 0.27 to 0.31 (+14.8%)
F1 score increased from 0.35 to 0.41 (+17.1%)
Entity persistence in stories with 5+ frames rose from 29.3% to 33.3% (+13.7%)
Abstract
Visual storytelling systems, particularly large vision-language models, struggle to maintain character and object identity across frames, often failing to recognize when entities in different images represent the same individuals or objects, leading to inconsistent references and referential hallucinations. This occurs because models lack explicit training on when to establish entity connections across frames. We propose a contrastive reinforcement learning approach that trains models to discriminate between coherent image sequences and stories from unrelated images. We extend the Story Reasoning dataset with synthetic negative examples to teach appropriate entity connection behavior. We employ Direct Preference Optimization with a dual-component reward function that promotes grounding and re-identification of entities in real stories while penalizing incorrect entity connections in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis
