Improving Visual Storytelling with Multimodal Large Language Models

Xiaochuan Lin; Xiangyong Chen

arXiv:2407.02586·cs.CV·July 4, 2024

Improving Visual Storytelling with Multimodal Large Language Models

Xiaochuan Lin, Xiangyong Chen

PDF

Open Access

TL;DR

This paper introduces a novel multimodal large language model approach for visual storytelling, combining vision and language models with instruction tuning to generate coherent, emotionally resonant stories, outperforming existing methods.

Contribution

It presents a new dataset and a fine-tuning method that significantly improves narrative coherence, relevance, and emotional depth in visual storytelling.

Findings

01

Outperforms existing models in coherence and relevance

02

Achieves higher scores in emotional depth and quality

03

Demonstrates effectiveness of instruction tuning

Abstract

Visual storytelling is an emerging field that combines images and narratives to create engaging and contextually rich stories. Despite its potential, generating coherent and emotionally resonant visual stories remains challenging due to the complexity of aligning visual and textual information. This paper presents a novel approach leveraging large language models (LLMs) and large vision-language models (LVLMs) combined with instruction tuning to address these challenges. We introduce a new dataset comprising diverse visual stories, annotated with detailed captions and multimodal elements. Our method employs a combination of supervised and reinforcement learning to fine-tune the model, enhancing its narrative generation capabilities. Quantitative evaluations using GPT-4 and qualitative human assessments demonstrate that our approach significantly outperforms existing models, achieving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Storytelling and Education · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam