OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
Pengfei Zhou, Xiaopeng Peng, Jiajun Song, Chuanhao Li, Zhaopan Xu, Yue, Yang, Ziyao Guo, Hao Zhang, Yuqi Lin, Yefei He, Lirui Zhao, Shuo Liu, Tianhua, Li, Yuxuan Xie, Xiaojun Chang, Yu Qiao, Wenqi Shao, Kaipeng Zhang

TL;DR
OpenING is a new comprehensive benchmark with 5,400 instances across 56 tasks designed to evaluate interleaved image-text generation, along with IntJudge, a model for assessing these outputs, revealing significant room for improvement in current methods.
Contribution
The paper introduces OpenING, a large-scale, diverse benchmark for interleaved image-text generation, and presents IntJudge, an effective evaluation model surpassing GPT-based evaluators.
Findings
Current methods show substantial room for improvement.
IntJudge achieves 82.42% agreement with human judgments.
OpenING covers diverse real-world scenarios for robust evaluation.
Abstract
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to limitations in data size and diversity. To bridge this gap, we introduce OpenING, a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Digital Storytelling and Education
MethodsEmirates Airlines Office in Dubai
