Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters
Bryan Sanchez

TL;DR
This paper introduces a post-transformer adapter that corrects suppressed factual log-probabilities in language models, improving factual accuracy on politically sensitive topics with minimal parameter addition.
Contribution
It demonstrates that a small adapter trained on frozen states can effectively correct factual suppression across multiple model scales and generalize to unseen facts.
Findings
The adapter memorizes training facts and generalizes to held-out facts.
Applying the adapter only at the last token position yields coherent, less censored text.
A silent gradient bug in Apple MLX caused null results in earlier experiments.
Abstract
Alignment-tuned language models frequently suppress factual log-probabilities on politically sensitive topics despite retaining the knowledge in their hidden representations. We show that a 786K-parameter (approximately 0.02% of the base model) post-transformer adapter, trained on frozen hidden states, corrects this suppression on 31 ideology-discriminating facts across Qwen3-4B, 8B, and 14B. The adapter memorizes all 15 training facts and generalizes to 11--39% of 16 held-out facts across 5 random splits per scale, with zero knowledge regressions via anchored training. Both gated (SwiGLU) and ungated (linear bottleneck) adapters achieve comparable results; neither consistently outperforms the other (Fisher exact p > 0.09 at all scales). On instruct models, the adapter corrects log-probability rankings. When applied at all token positions during generation, the adapter produces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
