Story Visualization by Online Text Augmentation with Context Memory

Daechul Ahn; Daneul Kim; Gwangmo Song; Seung Hwan Kim; Honglak Lee,; Dongyeop Kang; Jonghyun Choi

arXiv:2308.07575·cs.CV·August 22, 2023

Story Visualization by Online Text Augmentation with Context Memory

Daechul Ahn, Daneul Kim, Gwangmo Song, Seung Hwan Kim, Honglak Lee,, Dongyeop Kang, Jonghyun Choi

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel memory-augmented transformer framework with online text augmentation for story visualization, effectively capturing long-term context and improving image generation quality across multiple sentences.

Contribution

It proposes a new memory architecture combined with online text augmentation within a transformer framework to enhance long-term context encoding in story visualization.

Findings

01

Significantly outperforms state-of-the-art methods on SV benchmarks.

02

Improves metrics like FID, character F1, and BLEU scores.

03

Achieves better generalization with similar or less computational complexity.

Abstract

Story visualization (SV) is a challenging text-to-image generation task for the difficulty of not only rendering visual details from the text descriptions but also encoding a long-term context across multiple sentences. While prior efforts mostly focus on generating a semantically relevant image for each sentence, encoding a context spread across the given paragraph to generate contextually convincing images (e.g., with a correct character or with a proper background of the scene) remains a challenge. To this end, we propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation that generates multiple pseudo-descriptions as supplementary supervision during training for better generalization to the language variation at inference. In extensive experiments on the two popular SV benchmarks, i.e., the Pororo-SV and Flintstones-SV, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yonseivnl/cmota
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Absolute Position Encodings · Residual Connection