Pragma-VL: Towards a Pragmatic Arbitration of Safety and Helpfulness in MLLMs

Ming Wen; Kun Yang; Xin Chen; Jingyu Zhang; Dingding Han; Shiwen Cui; Yuedong Xu

arXiv:2603.13292·cs.LG·March 17, 2026

Pragma-VL: Towards a Pragmatic Arbitration of Safety and Helpfulness in MLLMs

Ming Wen, Kun Yang, Xin Chen, Jingyu Zhang, Dingding Han, Shiwen Cui, Yuedong Xu

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

Pragma-VL is an end-to-end alignment method for multimodal large language models that pragmatically balances safety and helpfulness, improving safety benchmarks while maintaining reasoning and knowledge capabilities.

Contribution

It introduces a novel safety arbitration framework with risk-aware visual perception and a theoretically-guaranteed reward model for better safety-helpfulness trade-offs.

Findings

01

Outperforms baselines by 5-20% on safety benchmarks.

02

Maintains strong performance in mathematics and knowledge reasoning.

03

Effectively balances safety and helpfulness in multimodal models.

Abstract

Multimodal Large Language Models (MLLMs) pose critical safety challenges, as they are susceptible not only to adversarial attacks such as jailbreaking but also to inadvertently generating harmful content for benign users. While internal safety alignment via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is a primary mitigation strategy, current methods often face a safety-utility trade-off: they either refuse benign queries out of excessive caution or overlook latent risks in cross-modal interactions. To resolve this, we introduce Pragma-VL, an end-to-end alignment algorithm that enables MLLMs to pragmatically arbitrate between safety and helpfulness. First, we enhance visual risk perception with a novel cold-start SFT stage. This is achieved by applying risk-aware clustering to the visual encoder and using an interleaved dataset of risk descriptions and high-quality data.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- **Methodological Completeness**: Pragma-VL is a well designed, end to end system. It correctly identifies that policy alignment (RL) cannot fix fundamental perceptual issues. Thus, the proposed "cold start" SFT stage (first addressing visual risk perception before connecting to language cognition) is methodologically sound and rigorous. - **In depth Analysis of the Reward Model**: The paper's exploration of reward model (RM) architectures (single objective vs. sequential vs. parallel) is a hi

Weaknesses

- **Annotation Quality and Bias Risk**: Over reliance on AI annotation without human verification. The core "pragmatic arbitration" capability depends entirely on the PragmaSafe dataset. Its context weight labels are generated by GPT-4o. This heavy reliance on one AI model introduces two problems: (a) Bias Propagation: Systematic biases from GPT-4o may be propagated and solidified in Pragma-VL. (b) Lack of a Gold Standard: The paper trusts the AI annotations but lacks a human agreement study. Us

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper introduces a novel data labeling and augmentation method through the PragmaSafe approach, enhancing context-dependent preference labels. 2. The Pragma-VL framework provides a dynamic, context-aware solution to the critical trade-off between safety and helpfulness in MLLMs which is really important to the community 3. The authors conducted comprehensive experiments to validate the performance of Pragma-VL across various benchmarks. 4. The framework retains strong performance on gener

Weaknesses

The work remains limited by comparatively narrow model validation, heuristic cold-start design, reliance on GPT-4o annotations, lack of comparison to newer multi-objective baselines, homogeneous benchmarking, among a few others. These are issues that future work should address through broader empirical validation, human-calibrated evaluation, and open data release.

Reviewer 03Rating 4Confidence 3

Strengths

1. This paper is motivated by an important research problem: enabling MLLMs to dynamically arbitrate the helpfulness-safety trade-off. This is critical as focusing either on safety or helpfulness is inadequate. 2. This paper improves the ability of the visual encoder to perceive safety severity, which is largely ignored when training existing vision encoder.

Weaknesses

1.Many intuitions, explanations and motivations are missing when formulating the contextual data augmentation (Equation 1). For example, why do we need to sample the adjustment magnitude from a gaussian distribution? In addition, it is unclear why larger difference in variance could suggest a larger adjustment magnitude. 2. A clear formulation of parallel rewards are missing. The authors propose reward models with parallel rewards, along with other variants such as sequential and single. Howe

Code & Models

Datasets

SII-fleeeecer/PragmaSafe-Beavertails
dataset· 3 dl
3 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Advanced Graph Neural Networks