Spatial Colour Mixing Illusions as a Perception Stress Test for Vision-Language Models
Nicoleta-Nina Basoc, Adrian Cosma, Emilian Radoi

TL;DR
This paper investigates the perceptual weaknesses of vision-language models under structured colour distortions, revealing significant accuracy drops and proposing perception-aware preprocessing to enhance robustness.
Contribution
It introduces a framework of spatial colour mixing distortions, evaluates VLMs' performance degradation, and suggests perception-aware preprocessing as a practical improvement strategy.
Findings
VLM accuracy sharply declines with increased colour distortions
Scaling language models does not reliably improve robustness
Perception-aware preprocessing recovers significant performance
Abstract
Vision-language models (VLMs) achieve strong benchmark results, yet can exhibit systematic perceptual weaknesses: structured, large changes to pixel values can cause confident yet nonsensical predictions, even when the underlying scene remains easily recognizable to humans. We study this gap using Spatial Colour Mixing, a programmatic family of colour distortions that overlays structured patterns (in both RGB and Ostwald colour systems) onto natural images. We introduce a framework of eight spatial colour mixing variants and evaluate nine VLMs across three model families on four datasets. Across models and datasets, accuracy degrades sharply with increasing distortion, and scaling the language model does not reliably mitigate the failure. In a human study with 61 participants on an animal recognition dataset, humans substantially outperform VLMs under the same distortions. Finally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Categorization, perception, and language · Generative Adversarial Networks and Image Synthesis
