Rethinking Deep Alignment Through The Lens Of Incomplete Learning

Thong Bach; Dung Nguyen; Thao Minh Le; Truyen Tran

arXiv:2511.12155·cs.LG·November 18, 2025

Rethinking Deep Alignment Through The Lens Of Incomplete Learning

Thong Bach, Dung Nguyen, Thao Minh Le, Truyen Tran

PDF

Open Access

TL;DR

This paper analyzes why large language models struggle with safety alignment, identifying gradient weakening as a cause, and proposes targeted methods to improve adversarial robustness without sacrificing capabilities.

Contribution

It introduces the concept of base-favored tokens as indicators of incomplete safety learning and develops a novel completion method to enhance robustness.

Findings

01

48-98% reduction in attack success rates

02

Improved adversarial robustness across Llama and Qwen models

03

Preservation of general capabilities

Abstract

Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fails to transform model preferences in later response regions fully. We introduce base-favored tokens -- vocabulary elements where base models assign higher probability than aligned models -- as computational indicators of incomplete safety learning and develop a targeted completion method that addresses undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates dramatic improvements in adversarial robustness, with 48--98% reductions in attack success rates while preserving general…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Domain Adaptation and Few-Shot Learning