Rethinking Weight Tying: Pseudo-Inverse Tying for LM Stable Training and Updates

Jian Gu; Aldeida Aleti; Chunyang Chen; Hongyu Zhang

arXiv:2602.04556·cs.CL·May 11, 2026

Rethinking Weight Tying: Pseudo-Inverse Tying for LM Stable Training and Updates

Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

PDF

TL;DR

This paper introduces Pseudo-Inverse Tying (PIT), a novel method for stabilizing training and maintaining token interface consistency in language models by synchronizing embedding and unembedding projections.

Contribution

PIT guarantees a pseudo-inverse-consistent token interface during training, improving stability and explainability in language models, and is applicable to models from 256M to 1.3B parameters.

Findings

01

PIT enhances training stability during continued pretraining.

02

It enforces near-exact token-interface consistency across different settings.

03

PIT leads to more predictable lightweight adaptation after pretraining.

Abstract

Weight tying is widely used in compact language models to reduce parameters by sharing the token table between the input embedding and the output projection. However, parameter sharing alone does not guarantee a stable token interface: during training, the correspondence between encoding tokens into hidden states and decoding hidden states into logits can drift, worsening optimization sensitivity and weakening explainability probes that rely on a meaningful vocabulary-space decoder. We propose Pseudo-Inverse Tying (PIT), which synchronizes embedding and unembedding as coupled projections of a shared latent token memory, guaranteeing a pseudo-inverse-consistent interface throughout training. PIT maintains an orthonormal shared memory, obtained by polar initialization from a source checkpoint for continued pretraining or by random orthonormal initialization for from-scratch pretraining,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.