Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning

Daniel A. P. Oliveira; David Martins de Matos

arXiv:2507.07340·cs.CV·July 14, 2025

Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning

Daniel A. P. Oliveira, David Martins de Matos

PDF

Open Access

TL;DR

This paper introduces a contrastive reinforcement learning method to improve entity re-identification and consistency in visual storytelling models, enhancing their ability to maintain character and object references across frames.

Contribution

It presents a novel contrastive reinforcement learning approach that explicitly trains models to connect entities across frames, using synthetic negative examples and a dual reward function.

Findings

01

Grounding mAP improved from 0.27 to 0.31 (+14.8%)

02

F1 score increased from 0.35 to 0.41 (+17.1%)

03

Entity persistence in stories with 5+ frames rose from 29.3% to 33.3% (+13.7%)

Abstract

Visual storytelling systems, particularly large vision-language models, struggle to maintain character and object identity across frames, often failing to recognize when entities in different images represent the same individuals or objects, leading to inconsistent references and referential hallucinations. This occurs because models lack explicit training on when to establish entity connections across frames. We propose a contrastive reinforcement learning approach that trains models to discriminate between coherent image sequences and stories from unrelated images. We extend the Story Reasoning dataset with synthetic negative examples to teach appropriate entity connection behavior. We employ Direct Preference Optimization with a dual-component reward function that promotes grounding and re-identification of entities in real stories while penalizing incorrect entity connections in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis