SMSP: A Plug-and-Play Strategy of Multi-Scale Perception for MLLMs to Perceive Visual Illusions
Jinzhe Tu, Ruilei Guo, Zihan Guo, Junxiao Yang, Shiyao Cui, Minlie Huang

TL;DR
This paper identifies the high-frequency attention bias in Multimodal Large Language Models (MLLMs) as a cause of vulnerability to visual illusions and proposes a multi-scale perception strategy (SMSP) to improve their perception, significantly boosting accuracy.
Contribution
The paper introduces SMSP, a plug-and-play framework that aligns MLLMs' perception with human visual strategies by suppressing distracting high-frequency backgrounds.
Findings
SMSP improves MLLMs' accuracy on illusion images from 13.0% to 84.0%.
High-frequency attention bias causes MLLMs to overlook hidden patterns.
SMSP effectively mitigates the impact of visual illusions on MLLMs.
Abstract
Recent works have shown that Multimodal Large Language Models (MLLMs) are highly vulnerable to hidden-pattern visual illusions, where the hidden content is imperceptible to models but obvious to humans. This deficiency highlights a perceptual misalignment between current MLLMs and humans, and also introduces potential safety concerns. To systematically investigate this failure, we introduce IlluChar, a comprehensive and challenging illusion dataset, and uncover a key underlying mechanism for the models' failure: high-frequency attention bias, where the models are easily distracted by high-frequency background textures in illusion images, causing them to overlook hidden patterns. To address the issue, we propose the Strategy of Multi-Scale Perception (SMSP), a plug-and-play framework that aligns with human visual perceptual strategies. By suppressing distracting high-frequency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
