# IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement

**Authors:** Yuanzhe Shen, Zisu Huang, Zhengkang Guo, Yide Liu, Guanxu Chen, Ruicheng Yin, Xiaoqing Zheng, Xuanjing Huang

arXiv: 2508.20151 · 2025-08-29

## TL;DR

IntentionReasoner is a safeguard mechanism for large language models that uses intent reasoning and query rewriting to improve safety, reduce over-refusal, and maintain utility, based on a large annotated dataset and reinforcement learning.

## Contribution

It introduces a novel safeguard framework combining intent reasoning, multi-level safety classification, and query rewriting, trained on a large dataset with reinforcement learning for improved safety and utility.

## Key findings

- Outperforms existing safeguards on multiple benchmarks
- Reduces over-refusal rates significantly
- Enhances safety against jailbreak attacks

## Abstract

The rapid advancement of large language models (LLMs) has driven their adoption across diverse domains, yet their ability to generate harmful content poses significant safety challenges. While extensive research has focused on mitigating harmful outputs, such efforts often come at the cost of excessively rejecting harmless prompts. Striking a balance among safety, over-refusal, and utility remains a critical challenge. In this work, we introduce IntentionReasoner, a novel safeguard mechanism that leverages a dedicated guard model to perform intent reasoning, multi-level safety classification, and query rewriting to neutralize potentially harmful intent in edge-case queries. Specifically, we first construct a comprehensive dataset comprising approximately 163,000 queries, each annotated with intent reasoning, safety labels, and rewritten versions. Supervised fine-tuning is then applied to equip the guard model with foundational capabilities in format adherence, intent analysis, and safe rewriting. Finally, we apply a tailored multi-reward optimization strategy that integrates rule-based heuristics and reward model signals within a reinforcement learning framework to further enhance performance. Extensive experiments show that IntentionReasoner excels in multiple safeguard benchmarks, generation quality evaluations, and jailbreak attack scenarios, significantly enhancing safety while effectively reducing over-refusal rates and improving the quality of responses.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20151/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20151/full.md

## References

48 references — full list in the complete paper: https://tomesphere.com/paper/2508.20151/full.md

---
Source: https://tomesphere.com/paper/2508.20151