DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

Ziwen Pan; Zihan Liang; Jad Kabbara; Ali Emami

arXiv:2604.16845·cs.CL·April 21, 2026

DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

Ziwen Pan, Zihan Liang, Jad Kabbara, Ali Emami

PDF

TL;DR

This paper introduces DART, a training method to reduce harm drift in difference-aware large language models by distilling reasoning, auditing outputs, and repairing problematic cases, improving accuracy and safety.

Contribution

DART is a novel training framework that mitigates harm drift in LLMs by combining distillation, auditing, and targeted fine-tuning, enhancing difference-awareness and safety.

Findings

01

DART increased Llama-3-8B-Instruct accuracy from 39.0% to 68.8%.

02

Harm drift cases were reduced by 72.6%.

03

Difference-appropriate responses improved from 39.8% to 77.5%.

Abstract

Large language models (LLMs) tuned for safety often avoid acknowledging demographic differences, even when such acknowledgment is factually correct (e.g., ancestry-based disease incidence) or contextually justified (e.g., religious hiring preferences). This identity-blindness yields incorrect responses, unnecessary refusals, or generic "equal-treatment" defaults. We study this via difference-awareness classification: given a question involving demographic groups, the task is not to answer directly, but to classify whether a correct answer requires recognizing group differences (yes) or whether groups should be treated identically (no). Crucially, fine-tuning for accuracy triggers harm drift: model-generated explanations become increasingly harmful as decision accuracy improves, whether by elaborating harmful content, introducing problematic assumptions, or failing to flag harms the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.