TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Amitava Das; Vinija Jain; Aman Chadha

arXiv:2508.02063·cs.AI·August 5, 2025

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Amitava Das, Vinija Jain, Aman Chadha

PDF

Open Access

TL;DR

TraceAlign introduces a comprehensive framework to trace, understand, and mitigate alignment failures in large language models by identifying training-time belief conflicts, leading to significant reductions in unsafe outputs while maintaining task performance.

Contribution

The paper presents TraceAlign, a novel framework with a belief conflict index and three interventions to trace and reduce alignment drift in LLMs, addressing a key gap in understanding training-time causes.

Findings

01

Reduces alignment drift by up to 85% on the ADB benchmark.

02

Maintains utility on standard tasks with less than 0.2 utility loss.

03

Provides a theoretical upper bound on drift likelihood based on suffix-array statistics.

Abstract

Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Scientific Computing and Data Management