Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)
Hisashi Miyashita

TL;DR
This paper demonstrates that a simple SVD analysis of LLM weights reveals interpretable semantic subspaces, exposing training data biases, ethical issues, and potential for safety auditing and model control.
Contribution
It introduces a lightweight SVD-based method to analyze LLM weights, revealing semantic subspaces, biases, and safety concerns without inference, and proposes new metrics for evaluation.
Findings
Different models show distinct vocabulary cluster structures.
Ethically concerning subspaces originate in pretraining, not post-training.
WPS detects known glitch tokens without model inference.
Abstract
We show that singular value decomposition of the lm_head} weight matrix of a transformer-based large language model -- requiring only five lines of PyTorch and no model inference -- reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model's training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth-century English orthography, forming a stepwise clustering structure that may contribute to high output…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
