GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective   Hallucination Mitigation

Runchuan Zhu; Zinco Jiang; Jiang Wu; Zhipeng Ma; Jiahe Song; Fengshuo; Bai; Dahua Lin; Lijun Wu; Conghui He

arXiv:2502.05911·cs.CL·February 11, 2025

GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation

Runchuan Zhu, Zinco Jiang, Jiang Wu, Zhipeng Ma, Jiahe Song, Fengshuo, Bai, Dahua Lin, Lijun Wu, Conghui He

PDF

Open Access

TL;DR

GRAIT is a novel instruction tuning framework that uses gradient-driven techniques to improve LLMs' ability to refuse inappropriate responses, reducing hallucinations while maintaining helpfulness.

Contribution

This paper introduces GRAIT, a gradient-based refusal-aware instruction tuning method that balances hallucination mitigation and helpfulness in LLMs.

Findings

01

GRAIT significantly outperforms existing methods in refusal accuracy.

02

GRAIT reduces hallucinations effectively in question answering tasks.

03

The method maintains high response usefulness while rejecting unknown questions.

Abstract

Refusal-Aware Instruction Tuning (RAIT) aims to enhance Large Language Models (LLMs) by improving their ability to refuse responses to questions beyond their knowledge, thereby reducing hallucinations and improving reliability. Effective RAIT must address two key challenges: firstly, effectively reject unknown questions to minimize hallucinations; secondly, avoid over-refusal to ensure questions that can be correctly answered are not rejected, thereby maintain the helpfulness of LLM outputs. In this paper, we address the two challenges by deriving insightful observations from the gradient-based perspective, and proposing the Gradient-driven Refusal Aware Instruction Tuning Framework GRAIT: (1) employs gradient-driven sample selection to effectively minimize hallucinations and (2) introduces an adaptive weighting mechanism during fine-tuning to reduce the risk of over-refusal, achieving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTreatment of Major Depression · Schizophrenia research and treatment · Hallucinations in medical conditions