JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
Zhecan Wang, Junzhang Liu, Chia-Wei Tang, Hani Alomari, Anushka, Sivakumar, Rui Sun, Wenhao Li, Md. Atabuzzaman, Hammad Ayyubi, Haoxuan You,, Alvi Ishmam, Kai-Wei Chang, Shih-Fu Chang, Chris Thomas

TL;DR
JourneyBench is a new, challenging benchmark for vision-language understanding that tests models' fine-grained reasoning on generated images in unusual scenarios, revealing limitations in current multimodal models.
Contribution
The paper introduces JourneyBench, a comprehensive human-annotated benchmark specifically designed to evaluate fine-grained multimodal reasoning in generated images, addressing limitations of existing benchmarks.
Findings
State-of-the-art models perform poorly on JourneyBench.
JourneyBench reveals gaps in models' visual reasoning abilities.
Benchmark is highly challenging, exposing weaknesses in current models.
Abstract
Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on background language biases. Thus, strong performance on these benchmarks does not necessarily correlate with strong visual understanding. In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model's fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors. Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios where language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
