Openstory++: A Large-scale Dataset and Benchmark for Instance-aware   Open-domain Visual Storytelling

Zilyu Ye; Jinxiu Liu; Ruotian Peng; Jinjin Cao; Zhiyang Chen; Yiyang; Zhang; Ziwei Xuan; Mingyuan Zhou; Xiaoqian Shen; Mohamed Elhoseiny; Qi Liu,; Guo-Jun Qi

arXiv:2408.03695·cs.CV·August 8, 2024

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Zilyu Ye, Jinxiu Liu, Ruotian Peng, Jinjin Cao, Zhiyang Chen, Yiyang, Zhang, Ziwei Xuan, Mingyuan Zhou, Xiaoqian Shen, Mohamed Elhoseiny, Qi Liu,, Guo-Jun Qi

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Openstory++ introduces a large-scale, instance-aware dataset and benchmark for open-domain visual storytelling, enabling models to generate consistent, high-quality narratives across complex, multi-instance visual data.

Contribution

It provides a novel dataset with instance-level annotations and a new benchmark framework for evaluating long-context multimodal generation tasks.

Findings

01

Openstory++ outperforms previous datasets in visual storytelling quality.

02

Models trained on Openstory++ show improved consistency in multi-instance scenarios.

03

Cohere-Bench effectively evaluates models on long-context multimodal tasks.

Abstract

Recent image generation models excel at creating high-quality images from brief captions. However, they fail to maintain consistency of multiple instances across images when encountering lengthy contexts. This inconsistency is largely due to in existing training datasets the absence of granular instance feature labeling in existing training datasets. To tackle these issues, we introduce Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. Furthermore, we develop a training methodology that emphasizes entity-centric image-text generation, ensuring that the models learn to effectively interweave visual and textual information. Specifically, Openstory++ streamlines the process of keyframe extraction from open-domain videos, employing vision-language models to generate captions that are then polished by a large language model for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YeLuoSuiYou/openstorypp
pytorchOfficial

Datasets

MAPLE-WestLake-AIGC/OpenstoryPlusPlus
dataset· 362 dl
362 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Storytelling and Education · Video Analysis and Summarization · Artificial Intelligence in Games