TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs
Amitava Das, Vinija Jain, Aman Chadha

TL;DR
TraceAlign introduces a comprehensive framework to trace, understand, and mitigate alignment failures in large language models by identifying training-time belief conflicts, leading to significant reductions in unsafe outputs while maintaining task performance.
Contribution
The paper presents TraceAlign, a novel framework with a belief conflict index and three interventions to trace and reduce alignment drift in LLMs, addressing a key gap in understanding training-time causes.
Findings
Reduces alignment drift by up to 85% on the ADB benchmark.
Maintains utility on standard tasks with less than 0.2 utility loss.
Provides a theoretical upper bound on drift likelihood based on suffix-array statistics.
Abstract
Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Scientific Computing and Data Management
