OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks
Jiayu Wang, Yang Jiao, Yue Yu, Tianwen Qian, Shaoxiang Chen, Jingjing Chen, Yu-Gang Jiang

TL;DR
OmniGenBench is a comprehensive benchmark with 57 diverse tasks designed to evaluate the instruction-following abilities of large multimodal models across perception and cognition, providing detailed comparisons of state-of-the-art models.
Contribution
The paper introduces OmniGenBench, a new benchmark that systematically assesses multimodal models across a wide range of real-world tasks using a dual-mode evaluation protocol.
Findings
GPT-4o outperforms other models on perception tasks.
Models show varied strengths across cognition-centric tasks.
Benchmark enables detailed performance analysis of multimodal models.
Abstract
Recent breakthroughs in large multimodal models (LMMs), such as the impressive GPT-4o-Native, have demonstrated remarkable proficiency in following general-purpose instructions for image generation. However, current benchmarks often lack the necessary breadth and depth to fully evaluate the diverse capabilities of these models. To overcome this limitation, we introduce OmniGenBench, a novel and comprehensive benchmark meticulously designed to assess the instruction-following abilities of state-of-the-art LMMs across both perception-centric and cognition-centric dimensions. Our OmniGenBench includes 57 diverse sub-tasks grounded in real-world scenarios, systematically categorized according to the specific model capabilities they demand. For rigorous evaluation, we further employ a dual-mode protocol. This protocol utilizes off-the-shelf visual parsing tools for perception-centric tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Speech and dialogue systems · AI in Service Interactions
MethodsVisual Parsing
