CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs

Jiaming Zhang; Rui Hu; Qing Guo; Wei Yang Bryan Lim

arXiv:2507.00817·cs.CV·July 2, 2025

CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs

Jiaming Zhang, Rui Hu, Qing Guo, Wei Yang Bryan Lim

PDF

Open Access

TL;DR

CAVALRY-V is a novel, efficient framework for generating adversarial attacks on Video Multimodal Large Language Models, significantly degrading their performance across various benchmarks and models.

Contribution

It introduces a dual-objective loss and a two-stage generator for effective, scalable adversarial attacks on V-MLLMs, enhancing attack transferability and temporal coherence.

Findings

01

Achieves 22.8% improvement over baselines on commercial and open-source models.

02

Significantly outperforms existing attack methods on video understanding benchmarks.

03

Improves image understanding performance by 34.4% on average.

Abstract

Video Multimodal Large Language Models (V-MLLMs) have shown impressive capabilities in temporal reasoning and cross-modal understanding, yet their vulnerability to adversarial attacks remains underexplored due to unique challenges: complex cross-modal reasoning mechanisms, temporal dependencies, and computational constraints. We present CAVALRY-V (Cross-modal Language-Vision Adversarial Yielding for Videos), a novel framework that directly targets the critical interface between visual perception and language generation in V-MLLMs. Our approach introduces two key innovations: (1) a dual-objective semantic-visual loss function that simultaneously disrupts the model's text generation logits and visual representations to undermine cross-modal integration, and (2) a computationally efficient two-stage generator framework that combines large-scale pre-training for cross-model transferability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis