Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models
Eyal Hadad, Mordechai Guri

TL;DR
This paper uncovers a dual-layer side-channel attack on local vision-language models that exploits dynamic preprocessing to infer sensitive input details, highlighting security vulnerabilities in on-device AI systems.
Contribution
It introduces a novel dual-layer attack framework exploiting execution-time and cache contention signals on dynamic preprocessing models, and discusses mitigation strategies with their trade-offs.
Findings
Attack reliably fingerprints input geometry using execution-time variations.
Cache contention profiling distinguishes between dense and sparse visual content.
Mitigation via constant-work padding incurs significant performance overhead.
Abstract
On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input's geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
