Understanding Token Probability Encoding in Output Embeddings

Hakaze Cho; Yoshihiro Sakai; Kenshiro Tanaka; Mariko Kato; Naoya Inoue

arXiv:2406.01468·cs.CL·December 12, 2024·1 cites

Understanding Token Probability Encoding in Output Embeddings

Hakaze Cho, Yoshihiro Sakai, Kenshiro Tanaka, Mariko Kato, Naoya Inoue

PDF

Open Access

TL;DR

This paper uncovers an approximate log-linear encoding of token probabilities in language model output embeddings, demonstrating its accuracy, sparsity, and implications for model efficiency and understanding pre-training dynamics.

Contribution

It reveals the sparse, log-linear structure of output token probabilities in embeddings and shows that many embedding dimensions are redundant, enabling potential model compression.

Findings

01

Output probabilities are encoded in a sparse, log-linear manner.

02

Over 30% of output embedding dimensions can be removed without affecting performance.

03

Output embeddings capture token frequency information early in training.

Abstract

In this paper, we investigate the output token probability information in the output embedding of language models. We find an approximate common log-linear encoding of output token probabilities within the output embedding vectors and empirically demonstrate that it is accurate and sparse. As a causality examination, we steer the encoding in output embedding to modify the output probability distribution accurately. Moreover, the sparsity we find in output probability encoding suggests that a large number of dimensions in the output embedding do not contribute to causal language modeling. Therefore, we attempt to delete the output-unrelated dimensions and find more than 30% of the dimensions can be deleted without significant movement in output distribution and sequence generation. Additionally, in the pre-training dynamics of language models, we find that the output embeddings capture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning