TL;DR
LogiStory introduces a logic-aware framework for multi-image story visualization that explicitly models visual logic, improving narrative coherence and visual quality in generated stories.
Contribution
The paper proposes a novel multi-agent system that explicitly models visual logic, bridging story planning with visual generation for clearer and more coherent visual stories.
Findings
Significant improvement in narrative logic of generated stories.
Enhanced visual quality and coherence in multi-image story generation.
Introduction of the LogicTale benchmark for evaluating visual logic.
Abstract
Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
