Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

Qingxin Xiao; Peilin Zhao; Yangyang Zhao; Lingwei Dang; Qingyao Wu

arXiv:2605.10676·cs.CV·May 12, 2026

Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

Qingxin Xiao, Peilin Zhao, Yangyang Zhao, Lingwei Dang, Qingyao Wu

PDF

TL;DR

This paper introduces ACE, a training-free method that balances visual and linguistic information in multimodal models by perturbing visual context, reducing hallucinations and improving trustworthiness during decoding.

Contribution

The paper proposes ACE, a novel adversarial framework that dynamically balances vision and language in multimodal models without additional training.

Findings

01

ACE improves model trustworthiness with negligible inference overhead.

02

ACE effectively suppresses hallucinations caused by equilibrium imbalance.

03

Experiments show ACE enhances decoding accuracy and reliability.

Abstract

During MLLM decoding, attention often abnormally concentrates on irrelevant image tokens. While existing research dismisses this as invalid noise and forcibly redirects attention to compel focusing on key image information, we argue these tokens are critical carriers of visual and narrative logic, and such coercive corrections exacerbate visual-language imbalance. Adopting a "decoding-as-game" perspective, we reveal that hallucinations stem from an equilibrium imbalance between linguistic priors and visual information. We propose Adversarial Counter-Commonsense Equilibrium (ACE), a training-free framework that perturbs visual context via counter-commonsense patches. Leveraging the fact that authentic visual features remain stable under perturbation while hallucinations fluctuate, ACE implements a dynamic game decoding strategy. This approach precisely suppresses perturbation-sensitive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.