Why MLLMs Struggle to Determine Object Orientations
Anju Gopinath, Nikhil Krishnaswamy, Bruce Draper

TL;DR
This study tests whether visual encoders in multimodal large language models (MLLMs) retain object orientation information, finding that such information is present but diffusely distributed, challenging previous assumptions about encoder limitations.
Contribution
The paper empirically demonstrates that visual encoders in MLLMs preserve orientation information, contradicting prior hypotheses that failures stem from encoder deficiencies.
Findings
Orientation info is recoverable from encoder embeddings.
Linear models accurately predict object orientations from features.
Orientation information is diffusely spread across many features.
Abstract
Multimodal Large Language Models (MLLMs) struggle with tasks that require reasoning about 2D object orientation in images, as documented in prior work. Tong et al. and Nichols et al. hypothesize that these failures originate in the visual encoder, since commonly used encoders such as CLIP and SigLIP are trained for image-text semantic alignment rather than geometric reasoning. We design a controlled empirical protocol to test this claim by measuring whether rotations can be recovered from encoder representations. In particular, we examine SigLIP and ViT features from LLaVA OneVision and Qwen2.5-VL-7B-Instruct models, respectively, using full images, and examine CLIP representations in LLaVA 1.5 and 1.6 using rotated foreground patches against natural background images. Our null hypothesis is that orientation information is not preserved in the encoder embeddings and we test this by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
