Enhancing Latent Computation in Transformers with Latent Tokens
Yuchang Sun, Yanxi Chen, Yaliang Li, Bolin Ding

TL;DR
This paper introduces latent tokens, a lightweight augmentation for Transformer-based language models that improves performance and out-of-distribution generalization by steering decoding through attention mechanisms.
Contribution
The paper proposes a novel, parameter-efficient method called latent tokens to enhance LLMs, seamlessly integrating with pre-trained models and improving adaptability.
Findings
Latent tokens significantly outperform baselines in out-of-distribution tasks.
The method can be integrated with pre-trained Transformers with minimal overhead.
Synthetic tasks verify the hypotheses about latent tokens' mechanisms.
Abstract
Augmenting large language models (LLMs) with auxiliary tokens has emerged as a promising strategy for enhancing model performance. In this work, we introduce a lightweight method termed latent tokens; these are dummy tokens that may be non-interpretable in natural language but steer the autoregressive decoding process of a Transformer-based LLM via the attention mechanism. The proposed latent tokens can be seamlessly integrated with a pre-trained Transformer, trained in a parameter-efficient manner, and applied flexibly at inference time, while adding minimal complexity overhead to the existing infrastructure of standard Transformers. We propose several hypotheses about the underlying mechanisms of latent tokens and design synthetic tasks accordingly to verify them. Numerical results confirm that the proposed method noticeably outperforms the baselines, particularly in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Topic Modeling · Explainable Artificial Intelligence (XAI)
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Softmax · Position-Wise Feed-Forward Layer
