Cross-Modal Attention Analysis and Optimization in Vision-Language Models: A Study on Visual Reliability
Lijie Zhou

TL;DR
This study analyzes and improves how vision-language models rely on visual versus textual information, proposing an adversarial framework and optimization techniques that significantly enhance visual feature utilization.
Contribution
It introduces an adversarial evaluation framework and an optimized training strategy that reduces reliance on textual shortcuts in vision-language models.
Findings
Optimized model reduces accuracy degradation from 27.5% to 9.8%.
Attention visualization shows increased focus on visual features.
Model maintains 97% accuracy on normal data.
Abstract
Vision-Language Models (VLMs) achieve strong cross-modal performance, yet recent evidence suggests they over-rely on textual descriptions while under-utilizing visual evidence -- a phenomenon termed ``text shortcut learning.'' We propose an adversarial evaluation framework that quantifies this cross-modal dependency by measuring accuracy degradation (Drop) when semantically conflicting text is paired with unchanged images. Four adversarial strategies -- shape\_swap, color\_swap, position\_swap, and random\_text -- are applied to a controlled geometric-shapes dataset (). We compare three configurations: Baseline CLIP (ViT-B/32), LoRA fine-tuning, and LoRA Optimized (integrating Hard Negative Mining, Label Smoothing, layer-wise learning rates, Cosine Restarts, curriculum learning, and data augmentation). The optimized model reduces average Drop from 27.5\% to 9.8\% (64.4\%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
