Visual Reasoning Evaluation of Grok, Deepseek Janus, Gemini, Qwen,   Mistral, and ChatGPT

Nidhal Jegham; Marwan Abdelatti; Abdeltawab Hendawi

arXiv:2502.16428·cs.CV·February 25, 2025

Visual Reasoning Evaluation of Grok, Deepseek Janus, Gemini, Qwen, Mistral, and ChatGPT

Nidhal Jegham, Marwan Abdelatti, Abdeltawab Hendawi

PDF

Open Access

TL;DR

This paper introduces a comprehensive benchmark for evaluating multimodal large language models on multi-image reasoning, stability, and uncertainty, revealing insights into model performance, biases, and the impact of size and architecture.

Contribution

It presents a novel benchmark incorporating multi-image reasoning, rejection-based evaluation, and entropy metrics, advancing the assessment of multimodal LLMs beyond traditional single-image tests.

Findings

01

ChatGPT-o1 achieves highest overall accuracy (82.5%)

02

QVQ-72B-Preview shows superior rejection accuracy (85.5%)

03

Janus models exhibit high entropy and bias, indicating unstable reasoning

Abstract

Traditional evaluations of multimodal large language models (LLMs) have been limited by their focus on single-image reasoning, failing to assess crucial aspects like contextual understanding, reasoning stability, and uncertainty calibration. This study addresses these limitations by introducing a novel benchmark that integrates multi-image reasoning tasks with rejection-based evaluation and positional bias detection. To evaluate these dimensions, we further introduce entropy as a novel metric for quantifying reasoning consistency across reordered answer variants. We applied this benchmark to assess Grok 3, ChatGPT-4o, ChatGPT-o1, Gemini 2.0 Flash Experimental, DeepSeek Janus models, Qwen2.5-VL-72B-Instruct, QVQ-72B-Preview, and Pixtral 12B across eight visual reasoning tasks, including difference spotting and diagram interpretation. Our findings reveal ChatGPT-o1 leading in overall…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education

MethodsFocus