TL;DR
The paper introduces TRACE, a scalable framework for realigning large language models by optimizing existing data to address evolving alignment policies without requiring new human annotations.
Contribution
TRACE transforms realignment into an optimization problem over existing data, reducing reliance on re-annotation and handling evolving alignment guidelines effectively.
Findings
Demonstrates robust realignment on multiple LLMs and datasets.
Maintains general utility while improving alignment with policy changes.
Operates effectively without additional human annotation.
Abstract
Post-training alignment of large language models (LLMs) relies on large-scale human annotations guided by policy specifications that change over time. Cultural shifts, value reinterpretations, and regulatory or industrial updates make static alignment increasingly brittle. As policies evolve, deployed models can diverge from current alignment objectives, creating an Alignment-Reality Gap that is difficult to audit or correct. Existing remediation typically requires re-annotation under revised guidelines, which introduces systematic challenges, including guideline ambiguity, annotator interpretation drift, and reduced consistency at scale. We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework that transforms realignment into a structured optimization problem over existing data without requiring fresh human annotation. Leveraging a stronger model as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
