DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training
Ziwen Pan, Zihan Liang, Jad Kabbara, Ali Emami

TL;DR
This paper introduces DART, a training method to reduce harm drift in difference-aware large language models by distilling reasoning, auditing outputs, and repairing problematic cases, improving accuracy and safety.
Contribution
DART is a novel training framework that mitigates harm drift in LLMs by combining distillation, auditing, and targeted fine-tuning, enhancing difference-awareness and safety.
Findings
DART increased Llama-3-8B-Instruct accuracy from 39.0% to 68.8%.
Harm drift cases were reduced by 72.6%.
Difference-appropriate responses improved from 39.8% to 77.5%.
Abstract
Large language models (LLMs) tuned for safety often avoid acknowledging demographic differences, even when such acknowledgment is factually correct (e.g., ancestry-based disease incidence) or contextually justified (e.g., religious hiring preferences). This identity-blindness yields incorrect responses, unnecessary refusals, or generic "equal-treatment" defaults. We study this via difference-awareness classification: given a question involving demographic groups, the task is not to answer directly, but to classify whether a correct answer requires recognizing group differences (yes) or whether groups should be treated identically (no). Crucially, fine-tuning for accuracy triggers harm drift: model-generated explanations become increasingly harmful as decision accuracy improves, whether by elaborating harmful content, introducing problematic assumptions, or failing to flag harms the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
