Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

Yuxiao Lu; Lin Xu; Yang Sun; Wenjun Li; Jie Shi

arXiv:2603.03323·cs.CL·March 5, 2026

Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

Yuxiao Lu, Lin Xu, Yang Sun, Wenjun Li, Jie Shi

PDF

Open Access 3 Reviews

TL;DR

This paper introduces DCR, a contrastive refinement method that improves large language models' ability to distinguish truly toxic prompts from superficially toxic ones, reducing over-refusal without sacrificing safety or general capabilities.

Contribution

The paper proposes a novel alignment stage, DCR, that enhances toxicity discernment in LLMs through contrastive learning, addressing over-refusal issues more effectively than prior methods.

Findings

01

DCR significantly reduces over-refusal in LLMs.

02

DCR maintains safety and helpfulness of models.

03

Empirical results show improved toxicity discrimination.

Abstract

Large language models (LLMs) aligned for safety often suffer from over-refusal, the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model's ability to reject genuinely harmful content. We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model's learning dynamics. To address it, we introduce a preceding alignment stage, DCR: Discernment via Contrastive Refinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM's capacity to distinguish truly toxic prompts from…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. Proposed method is a simple pre-alignment contrastive phase, i.e., existing SFT pipelines remain unchanged. 2. The gradient/NTK similarity analysis provides mathematical foundations for understanding refusal co-movement. 3. The paper shows mathematically that reducing representation similarity limits how refusals spread.

Weaknesses

1. Using XSTest in the contrastive stage and again in evaluation may create bias toward in-distribution advantage. 2. Justifications for several design choices are missing: (i) The methodology for determining which layers receive contrastive loss for each model architecture is not explained. (ii) The selection of circle loss over alternatives (e.g., InfoNCE, NT-Xent) lacks comparative analysis. (iii) No analysis of how varying toxic/seemingly-toxic sampling ratios affects model performance and

Reviewer 02Rating 8Confidence 3

Strengths

1. **Clear theoretical grounding.** - The paper (Section 5, 7.3) provides a formal link between activation similarity and empirical neural tangent kernel similarity to show how their method lowers gradient coupling. 2. **The method is conceptually simple and doesn’t require architectural modification.** It targets directly the mechanism behind over-refusal (gradient coupling) rather than surface behaviors, which is underexplored.

Weaknesses

1. **Rejection probability calculation can be biased.** - As described in Appendix A.5, the refusal probability aggregates mass over a fixed list of refusal strings. Models can refuse with more nuanced paraphrases. Providing calibration such as precision/recall could help further support the robustness of this metric. Maybe adding a learned refusal classifier for this metric would be useful too. 2. **Formatting.** - Citations that should be parenthetical are written as “Author, Year” instead of

Reviewer 03Rating 6Confidence 4

Strengths

It provides a novel insight into the over-refusal in LLM and provides a reasonable understanding of the cause of the over-refusal. The proposed approach is well-motivated by the observation of high correlation between the refusal rates of truly toxic and seemingly toxic prompts. The proposed approach is compared with several baseline approaches and the experiments show promising performance of the proposed approach.

Weaknesses

The evaluation is done with relatively small models (up to 8B models) and the scalability is not shown. I found the observations in Figures 1 and 3 are interesting and these are the cores of the proposed approach. However, this observation is provided only for a small (1.5B) model. To support the claim of this paper, it is recommended to add the observations for middle sized (7–8B) models. The strength of the contrastive learning (such as the number of epoch, learning rate, etc.) may affect th

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)