Cross-Modal Attention Analysis and Optimization in Vision-Language Models: A Study on Visual Reliability

Lijie Zhou

arXiv:2604.17217·cs.CV·April 21, 2026

Cross-Modal Attention Analysis and Optimization in Vision-Language Models: A Study on Visual Reliability

Lijie Zhou

PDF

TL;DR

This study analyzes and improves how vision-language models rely on visual versus textual information, proposing an adversarial framework and optimization techniques that significantly enhance visual feature utilization.

Contribution

It introduces an adversarial evaluation framework and an optimized training strategy that reduces reliance on textual shortcuts in vision-language models.

Findings

01

Optimized model reduces accuracy degradation from 27.5% to 9.8%.

02

Attention visualization shows increased focus on visual features.

03

Model maintains 97% accuracy on normal data.

Abstract

Vision-Language Models (VLMs) achieve strong cross-modal performance, yet recent evidence suggests they over-rely on textual descriptions while under-utilizing visual evidence -- a phenomenon termed ``text shortcut learning.'' We propose an adversarial evaluation framework that quantifies this cross-modal dependency by measuring accuracy degradation (Drop) when semantically conflicting text is paired with unchanged images. Four adversarial strategies -- shape\_swap, color\_swap, position\_swap, and random\_text -- are applied to a controlled geometric-shapes dataset ( $n = 1, 000$ ). We compare three configurations: Baseline CLIP (ViT-B/32), LoRA fine-tuning, and LoRA Optimized (integrating Hard Negative Mining, Label Smoothing, layer-wise learning rates, Cosine Restarts, curriculum learning, and data augmentation). The optimized model reduces average Drop from 27.5\% to 9.8\% (64.4\%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.