ReCap: Lightweight Referential Grounding for Coherent Story Visualization
Aditya Arora, Akshita Gupta, Pau Rodriguez, Marcus Rohrbach

TL;DR
ReCap is a lightweight framework that enhances character consistency and visual fidelity in story visualization without significant model modifications, using visual anchors and semantic correction.
Contribution
It introduces CORE and SemDrift modules that improve character stability with minimal additional parameters and no inference overhead.
Findings
ReCap outperforms previous state-of-the-art on main benchmarks by 2.63% and 5.65%.
ReCap maintains character identity stability even with vague or referential text.
It extends story visualization to real human-centric narratives.
Abstract
Story Visualization aims to generate a sequence of images that faithfully depicts a textual narrative that preserve character identity, spatial configuration, and stylistic coherence as the narratives unfold. Maintaining such cross-frame consistency has traditionally relied on explicit memory banks, architectural expansion, or auxiliary language models, resulting in substantial parameter growth and inference overhead. We introduce ReCap, a lightweight consistency framework that improves character stability and visual fidelity without modifying the base diffusion backbone. ReCap's CORE (COnditional frame REferencing) module treats anaphors, in our case pronouns, as visual anchors, activating only when characters are referred to by a pronoun and conditioning on the preceding frame to propagate visual identity. This selective design avoids unconditional cross-frame conditioning and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
