A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

Tianle Chen; Deepti Ghadiyaram

arXiv:2604.03995·cs.CV·April 7, 2026

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

Tianle Chen, Deepti Ghadiyaram

PDF

TL;DR

This paper systematically investigates how cross-modal typographic attacks can significantly compromise audio-visual large language models, revealing their vulnerabilities and the increased threat posed by coordinated multi-modal perturbations.

Contribution

It introduces Multi-Modal Typography, demonstrating the heightened effectiveness of cross-modal attacks over unimodal ones and highlighting an underexplored security risk in multi-modal reasoning models.

Findings

01

Cross-modal attacks achieve an 83.43% success rate.

02

Single-modality attacks have a 34.93% success rate.

03

Multi-modal typography poses a critical security threat.

Abstract

As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = $83.43%$ vs $34.93%$ ).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.