Paying Alignment Tax with Contrastive Learning
Buse Sibel Korkmaz, Rahul Nair, Elizabeth M. Daly, Antonio del Rio Chanona

TL;DR
This paper introduces a contrastive learning framework that effectively reduces bias and toxicity in language models while maintaining factual accuracy and knowledge, overcoming trade-offs faced by existing methods.
Contribution
The paper presents a novel contrastive learning approach with dynamic loss scaling that improves bias mitigation and faithfulness preservation simultaneously.
Findings
Significant reduction in toxicity across multiple benchmarks.
Enhanced faithfulness and knowledge retention in models.
First method to improve bias and accuracy concurrently.
Abstract
Current debiasing approaches often result a degradation in model capabilities such as factual accuracy and knowledge retention. Through systematic evaluation across multiple benchmarks, we demonstrate that existing debiasing methods face fundamental trade-offs, particularly in smaller models, leading to reduced truthfulness, knowledge loss, or unintelligible outputs. To address these limitations, we propose a contrastive learning framework that learns through carefully constructed positive and negative examples. Our approach introduces contrast computation and dynamic loss scaling to balance bias mitigation with faithfulness preservation. Experimental results across multiple model scales demonstrate that our method achieves substantial improvements in both toxicity reduction and faithfulness preservation. Most importantly, we show that our framework is the first to consistently improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFinancial Literacy, Pension, Retirement Analysis · Fiscal Policy and Economic Growth
MethodsContrastive Learning
