Penny Wise, Pixel Foolish: Bypassing Price Constraints in Multimodal Agents via Visual Adversarial Perturbations

Jiachen Qian; Zhaolu Kang

arXiv:2604.16515·cs.CV·April 21, 2026

Penny Wise, Pixel Foolish: Bypassing Price Constraints in Multimodal Agents via Visual Adversarial Perturbations

Jiachen Qian, Zhaolu Kang

PDF

TL;DR

This paper uncovers how imperceptible visual cues can deceive multimodal agents in financial tasks, proposing a novel attack method and defenses to improve robustness.

Contribution

It introduces PriceBlind, a white-box adversarial attack exploiting modality gaps in CLIP-based encoders for screenshot-based evaluation.

Findings

01

PriceBlind achieves 80% ASR in white-box evaluation.

02

Transfer attacks reach 35-41% ASR across multiple models.

03

Robust encoders and defenses reduce ASR but affect accuracy.

Abstract

The rapid proliferation of Multimodal Large Language Models (MLLMs) has enabled mobile agents to execute high-stakes financial transactions, but their adversarial robustness remains underexplored. We identify Visual Dominance Hallucination (VDH), where imperceptible visual cues can override textual price evidence in screenshot-based, price-constrained settings and lead agents to irrational decisions. We propose PriceBlind, a stealthy white-box adversarial attack framework for controlled screenshot-based evaluation. PriceBlind exploits the modality gap in CLIP-based encoders via a Semantic-Decoupling Loss that aligns the image embedding with low-cost, value-associated anchors while preserving pixel-level fidelity. On E-ShopBench, PriceBlind achieves around 80% ASR in white-box evaluation; under a simplified single-turn coordinate-selection protocol, Ensemble-DI-FGSM transfers with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.