OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks

Jiayu Wang; Yang Jiao; Yue Yu; Tianwen Qian; Shaoxiang Chen; Jingjing Chen; Yu-Gang Jiang

arXiv:2505.18775·cs.CV·May 27, 2025

OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks

Jiayu Wang, Yang Jiao, Yue Yu, Tianwen Qian, Shaoxiang Chen, Jingjing Chen, Yu-Gang Jiang

PDF

Open Access 1 Repo

TL;DR

OmniGenBench is a comprehensive benchmark with 57 diverse tasks designed to evaluate the instruction-following abilities of large multimodal models across perception and cognition, providing detailed comparisons of state-of-the-art models.

Contribution

The paper introduces OmniGenBench, a new benchmark that systematically assesses multimodal models across a wide range of real-world tasks using a dual-mode evaluation protocol.

Findings

01

GPT-4o outperforms other models on perception tasks.

02

Models show varied strengths across cognition-centric tasks.

03

Benchmark enables detailed performance analysis of multimodal models.

Abstract

Recent breakthroughs in large multimodal models (LMMs), such as the impressive GPT-4o-Native, have demonstrated remarkable proficiency in following general-purpose instructions for image generation. However, current benchmarks often lack the necessary breadth and depth to fully evaluate the diverse capabilities of these models. To overcome this limitation, we introduce OmniGenBench, a novel and comprehensive benchmark meticulously designed to assess the instruction-following abilities of state-of-the-art LMMs across both perception-centric and cognition-centric dimensions. Our OmniGenBench includes 57 diverse sub-tasks grounded in real-world scenarios, systematically categorized according to the specific model capabilities they demand. For rigorous evaluation, we further employ a dual-mode protocol. This protocol utilizes off-the-shelf visual parsing tools for perception-centric tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

emilia113/omnigenbench
paddleOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Speech and dialogue systems · AI in Service Interactions

MethodsVisual Parsing