JourneyBench: A Challenging One-Stop Vision-Language Understanding   Benchmark of Generated Images

Zhecan Wang; Junzhang Liu; Chia-Wei Tang; Hani Alomari; Anushka; Sivakumar; Rui Sun; Wenhao Li; Md. Atabuzzaman; Hammad Ayyubi; Haoxuan You,; Alvi Ishmam; Kai-Wei Chang; Shih-Fu Chang; Chris Thomas

arXiv:2409.12953·cs.CV·January 13, 2025

JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images

Zhecan Wang, Junzhang Liu, Chia-Wei Tang, Hani Alomari, Anushka, Sivakumar, Rui Sun, Wenhao Li, Md. Atabuzzaman, Hammad Ayyubi, Haoxuan You,, Alvi Ishmam, Kai-Wei Chang, Shih-Fu Chang, Chris Thomas

PDF

Open Access 1 Repo

TL;DR

JourneyBench is a new, challenging benchmark for vision-language understanding that tests models' fine-grained reasoning on generated images in unusual scenarios, revealing limitations in current multimodal models.

Contribution

The paper introduces JourneyBench, a comprehensive human-annotated benchmark specifically designed to evaluate fine-grained multimodal reasoning in generated images, addressing limitations of existing benchmarks.

Findings

01

State-of-the-art models perform poorly on JourneyBench.

02

JourneyBench reveals gaps in models' visual reasoning abilities.

03

Benchmark is highly challenging, exposing weaknesses in current models.

Abstract

Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on background language biases. Thus, strong performance on these benchmarks does not necessarily correlate with strong visual understanding. In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model's fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors. Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios where language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

journeybench/journeybench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques