RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

Tianyi Niu; Jaemin Cho; Elias Stengel-Eskin; Mohit Bansal

arXiv:2508.13968·cs.CV·January 27, 2026

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper evaluates the ability of Multimodal Large Language Models to identify image rotations, revealing significant gaps in their spatial reasoning compared to human perception, and introduces RotBench, a new benchmark for this task.

Contribution

The paper introduces RotBench, a benchmark to evaluate MLLMs on image rotation identification, and provides a comprehensive analysis of their limitations in spatial reasoning.

Findings

01

Most models reliably identify upright images.

02

Models struggle to distinguish 90° and 270° rotations.

03

Fine-tuning improves 180° detection but not 90°/270° differentiation.

Abstract

We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench, a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

tianyin/RotBench
dataset· 71 dl
71 dl

Videos

RotBench: Evaluating Multi-modal Large Language Models on Identifying Image Rotation· underline

Taxonomy

TopicsSpeech and dialogue systems