GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models

Xingyu Zhu; Beier Zhu; Junfeng Fang; Shuo Wang; Yin Zhang; Xiang Wang; Xiangnan He

arXiv:2602.24027·cs.CV·March 2, 2026

GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models

Xingyu Zhu, Beier Zhu, Junfeng Fang, Shuo Wang, Yin Zhang, Xiang Wang, Xiangnan He

PDF

Open Access 3 Reviews

TL;DR

GuardAlign is a training-free safety framework for multimodal large language models that improves detection of unsafe content and maintains safety signals during output generation, enhancing safety without sacrificing performance.

Contribution

It introduces OT-enhanced safety detection and cross-modal attentive calibration, providing an effective, training-free safety mechanism for multimodal models.

Findings

01

Reduces unsafe responses by up to 39% on SPA-VL

02

Improves VQAv2 accuracy from 78.51% to 79.21%

03

Maintains model utility while enhancing safety

Abstract

Large vision-language models (LVLMs) have achieved remarkable progress in vision-language reasoning tasks, yet ensuring their safety remains a critical challenge. Recent input-side defenses detect unsafe images with CLIP and prepend safety prefixes to prompts, but they still suffer from inaccurate detection in complex scenes and unstable safety signals during decoding. To address these issues, we propose GuardAlign, a training-free defense framework that integrates two strategies. First, OT-enhanced safety detection leverages optimal transport to measure distribution distances between image patches and unsafe semantics, enabling accurate identification of malicious regions without additional computational cost. Second, cross-modal attentive calibration strengthens the influence of safety prefixes by adaptively reallocating attention across layers, ensuring that safety signals remain…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

+ GuardAlign operates at inference time and requires no fine-tuning, which is attractive for rapid deployment on large deployed MLLMs. + The coupling of fine-grained OT-based patch scoring with attention-level calibration for safety prefixes is an intuitive and novel pairing: detect & sanitize visual evidence, then ensure the LLM heeds the safety prefix.

Weaknesses

- The method is evaluated on many benchmarks but primarily in a black-box or benchmarked adversary setting. An adaptive attacker that crafts images to both avoid OT detection and trigger unsafe generations (e.g., by distributing harmful signals over many patches or embedding signals in texture) is not evaluated. GuardAlign’s resilience to adaptive/strong adversaries is unclear.

Reviewer 02Rating 4Confidence 4

Strengths

The paper is original in combining optimal transport-based detection with attention calibration for inference-time safety alignment. The technical quality is solid, with rigorous theoretical analysis and comprehensive evaluations across models and datasets. Clarity is high. both intuition and formulation are clearly articulated, and experimental design is systematic.

Weaknesses

The method, while efficient, introduces several hyperparameters (e.g., τ, γ) that are not fully analyzed for stability or generalizability. Evaluation is limited to vision–language reasoning; applicability to other modalities remains untested. The detection component depends on CLIP backbones, which could inherit existing biases.

Reviewer 03Rating 6Confidence 4

Strengths

- Training-free efficiency: GuardAlign operates entirely at inference time without requiring additional data or fine-tuning, making it highly practical and resource-efficient. - Comprehensive experimental validation: The paper provides thorough evaluations across multiple safety benchmarks and utility tasks, including detailed ablation studies and efficiency analyses. - Low inference overhead: Compared to existing inference-time defenses like ETA, GuardAlign achieves better safety with moder

Weaknesses

- Utility improvement: While the paper reports that GuardAlign avoids performance degradation and even boosts utility (e.g., VQAv2 accuracy improves from 78.51% to 79.21%), the underlying mechanism is not sufficiently analyzed. It remains unclear why masking unsafe patches or calibrating attention would enhance general capabilities—this warrants further theoretical or empirical justification. - Limited model scale evaluation: Experiments are confined to MLLMs up to 13B parameters (e.g., LLaVA

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning