Pragma-VL: Towards a Pragmatic Arbitration of Safety and Helpfulness in MLLMs
Ming Wen, Kun Yang, Xin Chen, Jingyu Zhang, Dingding Han, Shiwen Cui, Yuedong Xu

TL;DR
Pragma-VL is an end-to-end alignment method for multimodal large language models that pragmatically balances safety and helpfulness, improving safety benchmarks while maintaining reasoning and knowledge capabilities.
Contribution
It introduces a novel safety arbitration framework with risk-aware visual perception and a theoretically-guaranteed reward model for better safety-helpfulness trade-offs.
Findings
Outperforms baselines by 5-20% on safety benchmarks.
Maintains strong performance in mathematics and knowledge reasoning.
Effectively balances safety and helpfulness in multimodal models.
Abstract
Multimodal Large Language Models (MLLMs) pose critical safety challenges, as they are susceptible not only to adversarial attacks such as jailbreaking but also to inadvertently generating harmful content for benign users. While internal safety alignment via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is a primary mitigation strategy, current methods often face a safety-utility trade-off: they either refuse benign queries out of excessive caution or overlook latent risks in cross-modal interactions. To resolve this, we introduce Pragma-VL, an end-to-end alignment algorithm that enables MLLMs to pragmatically arbitrate between safety and helpfulness. First, we enhance visual risk perception with a novel cold-start SFT stage. This is achieved by applying risk-aware clustering to the visual encoder and using an interleaved dataset of risk descriptions and high-quality data.…
Peer Reviews
Decision·ICLR 2026 Poster
- **Methodological Completeness**: Pragma-VL is a well designed, end to end system. It correctly identifies that policy alignment (RL) cannot fix fundamental perceptual issues. Thus, the proposed "cold start" SFT stage (first addressing visual risk perception before connecting to language cognition) is methodologically sound and rigorous. - **In depth Analysis of the Reward Model**: The paper's exploration of reward model (RM) architectures (single objective vs. sequential vs. parallel) is a hi
- **Annotation Quality and Bias Risk**: Over reliance on AI annotation without human verification. The core "pragmatic arbitration" capability depends entirely on the PragmaSafe dataset. Its context weight labels are generated by GPT-4o. This heavy reliance on one AI model introduces two problems: (a) Bias Propagation: Systematic biases from GPT-4o may be propagated and solidified in Pragma-VL. (b) Lack of a Gold Standard: The paper trusts the AI annotations but lacks a human agreement study. Us
1. The paper introduces a novel data labeling and augmentation method through the PragmaSafe approach, enhancing context-dependent preference labels. 2. The Pragma-VL framework provides a dynamic, context-aware solution to the critical trade-off between safety and helpfulness in MLLMs which is really important to the community 3. The authors conducted comprehensive experiments to validate the performance of Pragma-VL across various benchmarks. 4. The framework retains strong performance on gener
The work remains limited by comparatively narrow model validation, heuristic cold-start design, reliance on GPT-4o annotations, lack of comparison to newer multi-objective baselines, homogeneous benchmarking, among a few others. These are issues that future work should address through broader empirical validation, human-calibrated evaluation, and open data release.
1. This paper is motivated by an important research problem: enabling MLLMs to dynamically arbitrate the helpfulness-safety trade-off. This is critical as focusing either on safety or helpfulness is inadequate. 2. This paper improves the ability of the visual encoder to perceive safety severity, which is largely ignored when training existing vision encoder.
1.Many intuitions, explanations and motivations are missing when formulating the contextual data augmentation (Equation 1). For example, why do we need to sample the adjustment magnitude from a gaussian distribution? In addition, it is unclear why larger difference in variance could suggest a larger adjustment magnitude. 2. A clear formulation of parallel rewards are missing. The authors propose reward models with parallel rewards, along with other variants such as sequential and single. Howe
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Advanced Graph Neural Networks
