Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement
Yuxiao Lu, Lin Xu, Yang Sun, Wenjun Li, Jie Shi

TL;DR
This paper introduces DCR, a contrastive refinement method that improves large language models' ability to distinguish truly toxic prompts from superficially toxic ones, reducing over-refusal without sacrificing safety or general capabilities.
Contribution
The paper proposes a novel alignment stage, DCR, that enhances toxicity discernment in LLMs through contrastive learning, addressing over-refusal issues more effectively than prior methods.
Findings
DCR significantly reduces over-refusal in LLMs.
DCR maintains safety and helpfulness of models.
Empirical results show improved toxicity discrimination.
Abstract
Large language models (LLMs) aligned for safety often suffer from over-refusal, the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model's ability to reject genuinely harmful content. We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model's learning dynamics. To address it, we introduce a preceding alignment stage, DCR: Discernment via Contrastive Refinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM's capacity to distinguish truly toxic prompts from…
Peer Reviews
Decision·ICLR 2026 Poster
1. Proposed method is a simple pre-alignment contrastive phase, i.e., existing SFT pipelines remain unchanged. 2. The gradient/NTK similarity analysis provides mathematical foundations for understanding refusal co-movement. 3. The paper shows mathematically that reducing representation similarity limits how refusals spread.
1. Using XSTest in the contrastive stage and again in evaluation may create bias toward in-distribution advantage. 2. Justifications for several design choices are missing: (i) The methodology for determining which layers receive contrastive loss for each model architecture is not explained. (ii) The selection of circle loss over alternatives (e.g., InfoNCE, NT-Xent) lacks comparative analysis. (iii) No analysis of how varying toxic/seemingly-toxic sampling ratios affects model performance and
1. **Clear theoretical grounding.** - The paper (Section 5, 7.3) provides a formal link between activation similarity and empirical neural tangent kernel similarity to show how their method lowers gradient coupling. 2. **The method is conceptually simple and doesn’t require architectural modification.** It targets directly the mechanism behind over-refusal (gradient coupling) rather than surface behaviors, which is underexplored.
1. **Rejection probability calculation can be biased.** - As described in Appendix A.5, the refusal probability aggregates mass over a fixed list of refusal strings. Models can refuse with more nuanced paraphrases. Providing calibration such as precision/recall could help further support the robustness of this metric. Maybe adding a learned refusal classifier for this metric would be useful too. 2. **Formatting.** - Citations that should be parenthetical are written as “Author, Year” instead of
It provides a novel insight into the over-refusal in LLM and provides a reasonable understanding of the cause of the over-refusal. The proposed approach is well-motivated by the observation of high correlation between the refusal rates of truly toxic and seemingly toxic prompts. The proposed approach is compared with several baseline approaches and the experiments show promising performance of the proposed approach.
The evaluation is done with relatively small models (up to 8B models) and the scalability is not shown. I found the observations in Figures 1 and 3 are interesting and these are the cores of the proposed approach. However, this observation is provided only for a small (1.5B) model. To support the claim of this paper, it is recommended to add the observations for middle sized (7–8B) models. The strength of the contrastive learning (such as the number of epoch, learning rate, etc.) may affect th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
