Improving Visual Storytelling with Multimodal Large Language Models
Xiaochuan Lin, Xiangyong Chen

TL;DR
This paper introduces a novel multimodal large language model approach for visual storytelling, combining vision and language models with instruction tuning to generate coherent, emotionally resonant stories, outperforming existing methods.
Contribution
It presents a new dataset and a fine-tuning method that significantly improves narrative coherence, relevance, and emotional depth in visual storytelling.
Findings
Outperforms existing models in coherence and relevance
Achieves higher scores in emotional depth and quality
Demonstrates effectiveness of instruction tuning
Abstract
Visual storytelling is an emerging field that combines images and narratives to create engaging and contextually rich stories. Despite its potential, generating coherent and emotionally resonant visual stories remains challenging due to the complexity of aligning visual and textual information. This paper presents a novel approach leveraging large language models (LLMs) and large vision-language models (LVLMs) combined with instruction tuning to address these challenges. We introduce a new dataset comprising diverse visual stories, annotated with detailed captions and multimodal elements. Our method employs a combination of supervised and reinforcement learning to fine-tune the model, enhancing its narrative generation capabilities. Quantitative evaluations using GPT-4 and qualitative human assessments demonstrate that our approach significantly outperforms existing models, achieving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Storytelling and Education · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam
