Rethinking Deep Alignment Through The Lens Of Incomplete Learning
Thong Bach, Dung Nguyen, Thao Minh Le, Truyen Tran

TL;DR
This paper analyzes why large language models struggle with safety alignment, identifying gradient weakening as a cause, and proposes targeted methods to improve adversarial robustness without sacrificing capabilities.
Contribution
It introduces the concept of base-favored tokens as indicators of incomplete safety learning and develops a novel completion method to enhance robustness.
Findings
48-98% reduction in attack success rates
Improved adversarial robustness across Llama and Qwen models
Preservation of general capabilities
Abstract
Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fails to transform model preferences in later response regions fully. We introduce base-favored tokens -- vocabulary elements where base models assign higher probability than aligned models -- as computational indicators of incomplete safety learning and develop a targeted completion method that addresses undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates dramatic improvements in adversarial robustness, with 48--98% reductions in attack success rates while preserving general…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Domain Adaptation and Few-Shot Learning
