The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs
Tejas Anvekar, Fenil Bardoliya, Pavan K. Turaga, Chitta Baral, Vivek Gupta

TL;DR
This paper introduces The Perceptual Observatory, a framework for evaluating the robustness and grounding of multimodal large language models across various visual tasks and perturbations, revealing their perceptual strengths and weaknesses.
Contribution
It provides a systematic evaluation framework that moves beyond accuracy to analyze perceptual grounding and robustness of MLLMs under controlled perturbations.
Findings
MLLMs maintain some grounding under perturbations but show weaknesses in complex tasks.
The framework reveals differences in robustness across models and tasks.
Perturbation-based evaluation offers new insights into model perceptual capabilities.
Abstract
Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language component while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), which raises pivotal concerns about whether progress reflects genuine visual grounding or reliance on internet-scale textual world knowledge. Existing evaluation methods emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like: (i) simple vision tasks, such as face matching and text-in-vision comprehension capabilities; (ii) local-to-global understanding, encompassing image matching, grid pointing game, and attribute localization,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Face recognition and analysis · Face Recognition and Perception
