Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption
Mehmet Kaan Erol

TL;DR
This paper investigates how compressed vision-language models fail differently from larger models under visual corruption, revealing distinct failure modes and calibration issues through comprehensive experiments.
Contribution
It introduces a diagnostic framework for failure modes in compressed VLMs, compares two models across multiple tests, and releases a reproducible pipeline for safety auditing.
Findings
Compact models exhibit a 12.5 percentage point larger negation collapse on COCO.
Semantic Drift is the dominant failure mode for the quantised model.
The fully reproducible pipeline is released for safety auditing.
Abstract
The rapid compression of large vision-language models (VLMs) for edge deployment raises an underexplored question: do compact models fail differently, not merely more often? This study compares a 7-billion-parameter quantised VLM (Qwen2.5-VL-7B, 4-bit NF4) against a 500-million-parameter FP16 model (SmolVLM2-500M) across 4,000 samples from VQAv2 and COCO Captions. A three-category error taxonomy (Object Blindness, Semantic Drift, Prior Bias) is applied as a diagnostic framework. A text-only GPT-4o judge reveals Semantic Drift (B) as the dominant failure mode on VQAv2 and on COCO for Qwen, with a mixed Object Blindness / Semantic Drift profile for SmolVLM2 on COCO; Prior Bias (C) is present on VQAv2 but absent on COCO for both models. Confidence calibration is measured via Expected Calibration Error (ECE) using geometric mean token probability, compositional reasoning is probed with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
