SEED-Story: Multimodal Long Story Generation with Large Language Model
Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan,, Yingcong Chen

TL;DR
SEED-Story introduces a multimodal story generation approach using a large language model that produces coherent, long narratives with interleaved images and texts, advancing the creation of complex multimedia stories.
Contribution
The paper presents SEED-Story, a novel multimodal story generation method leveraging a Multimodal Large Language Model with a new attention mechanism and a large dataset for training and evaluation.
Findings
Generated stories with up to 25 sequences demonstrating coherence
Efficient autoregressive generation with multimodal attention sink
High-quality, high-resolution multimodal stories produced
Abstract
With the remarkable advancements in image generation and open-form text generation, the creation of interleaved image-text content has become an increasingly intriguing field. Multimodal story generation, characterized by producing narrative texts and vivid images in an interleaved manner, has emerged as a valuable and practical task with broad applications. However, this task poses significant challenges, as it necessitates the comprehension of the complex interplay between texts and images, and the ability to generate long sequences of coherent, contextually relevant texts and visuals. In this work, we propose SEED-Story, a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories. Our model, built upon the powerful comprehension capability of MLLM, predicts text tokens as well as visual tokens, which are subsequently processed with…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The samples of generated outputs demonstrated in the manuscript look impressive, especially in long story generation, character consistency, and narrative text quality. This work would be significant when these examples objectively represent general performance. 2. The introduction of the StoryStream dataset is a valuable contribution that could advance future research in this field. The authors present StoryStream Dataset, which is already downloadable at https://huggingface.co/dat
1. This paper lacks essential elements such as a clear motivation and problem definition, making it difficult to judge whether the research goal is successfully achieved. Also, those make readers confused about whether the evaluation metrics and comparative methods are appropriate or sufficient. 2. In order to validate academic value, the manuscript should provide evidence theoretically or experimentally. The manuscript shows the quality of generated images with FID and CLIP score as a
- The proposed dataset has high-quality and high-resolution images with longer story length, which would be useful for research community. - The proposed model seems to make sense, and the presented analysis for multimodal attention sink mechanism is nice. - The experimental results show that the proposed approach outperforms the baseline on the proposed dataset and demonstrate its effectiveness even though it still has a limitation (please see the weaknesses below).
- The novelty of proposed approach seems somewhat incremental as it looks to share some ideas with MM-interleaved (Tian et al. (2024)). Also the proposed multimodal attention sink mechanism seems just a slight modification of existing attention sink mechanism even though it includes nice analysis. - The proposed approach was only evaluated on the proposed StoryStream dataset. However, as MM-interleaved (Tian et al. (2024)) was also evaluated on both Pororo and Flintstones datasets in their paper
The strongest contribution of the paper is the new dataset (StoryStream). There seems to be a need for this new dataset, given that the existing datasets look quite low-resolution and simple. It will be useful for other papers to use this dataset to train and evaluate. The major technical contribution of the paper is the multimodal attention sink. This is quite an interesting non-obvious observation that the model attributes high attention to tokens near BoI / EoI. Then the authors take this ob
The long generated stories don't seem to be very good in terms of storytelling. For example, Figure F does not really suggest any logical story (e.g. with a beginning, middle, end). It seems more like random captions were generated, especially near the end. The llm also starts repeating itself often (e.g. "A man in a hat and shirt looked surprised", "George, the cartoon monkey, stood in a grassy area". The authors in the paper say that this plot is "engaging", but IMO it is not engaging. This
- The tokenizer and de-tokenizer training coupled with multimodal instruction finetuning is neat. - The adaptation stage seems useful. - The extended multimodal attention sink mechanism is interesting and should be studied more.
- While the proposed method is evaluated well on the featured StoryStream dataset, the method should still be evaluated on existing ones such as Pororo and Flint Stones. - While I appreciate the proposed dataset, the visual consistency test is still lacking and a well defined metric particularly for visual consistency is much needed. - The attention sink mechanism is not well-studied yet in this manuscript, for example, how does the performance with its introduction scale with lengths, story com
+ The paper curated a high-resolution long story dataset StoryStream that contains 257k images with story length up to 30 frames. + It proposes a multimodal attention sink mechanism for efficient long-sequence generation. + The qualitative results are impressive for generating long multimodal stories.
+ The model architecture is quite similar SEED, with proposed multimodal attention sink. + The evaluation is not very comprehensive, giving existing literature also evaluates image consistency and character accuracy in the story generation setting.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need
