Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization
Adyasha Maharana, Mohit Bansal

TL;DR
This paper enhances story visualization by integrating linguistic, commonsense, and visual structures into the generation process, leading to more coherent and relevant image sequences from narrative texts.
Contribution
It introduces a novel multi-structure encoding framework combining linguistic parse trees, commonsense knowledge, and visual cues, improving visual story generation quality.
Findings
Improved visual quality and consistency in generated stories.
Enhanced spatial and narrative relevance through structured inputs.
Significant metric improvements and positive human evaluations.
Abstract
While much research has been done in text-to-image synthesis, little work has been done to explore the usage of linguistic structure of the input text. Such information is even more important for story visualization since its inputs have an explicit narrative structure that needs to be translated into an image sequence (or visual story). Prior work in this domain has shown that there is ample room for improvement in the generated image sequence in terms of visual quality, consistency and relevance. In this paper, we first explore the use of constituency parse trees using a Transformer-based recurrent architecture for encoding structured input. Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story. Third, we also incorporate visual structure via bounding boxes and dense captioning to provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Natural Language Processing Techniques
