CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs
Jiaming Zhang, Rui Hu, Qing Guo, Wei Yang Bryan Lim

TL;DR
CAVALRY-V is a novel, efficient framework for generating adversarial attacks on Video Multimodal Large Language Models, significantly degrading their performance across various benchmarks and models.
Contribution
It introduces a dual-objective loss and a two-stage generator for effective, scalable adversarial attacks on V-MLLMs, enhancing attack transferability and temporal coherence.
Findings
Achieves 22.8% improvement over baselines on commercial and open-source models.
Significantly outperforms existing attack methods on video understanding benchmarks.
Improves image understanding performance by 34.4% on average.
Abstract
Video Multimodal Large Language Models (V-MLLMs) have shown impressive capabilities in temporal reasoning and cross-modal understanding, yet their vulnerability to adversarial attacks remains underexplored due to unique challenges: complex cross-modal reasoning mechanisms, temporal dependencies, and computational constraints. We present CAVALRY-V (Cross-modal Language-Vision Adversarial Yielding for Videos), a novel framework that directly targets the critical interface between visual perception and language generation in V-MLLMs. Our approach introduces two key innovations: (1) a dual-objective semantic-visual loss function that simultaneously disrupts the model's text generation logits and visual representations to undermine cross-modal integration, and (2) a computationally efficient two-stage generator framework that combines large-scale pre-training for cross-model transferability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
