Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of   Existing Parameters via Cyclic Refinement

Guanghao Li; Wenhao Jiang; Li Shen; Ming Tang; Chun Yuan

arXiv:2502.12214·cs.CL·February 19, 2025

Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement

Guanghao Li, Wenhao Jiang, Li Shen, Ming Tang, Chun Yuan

PDF

Open Access

TL;DR

This paper introduces the Zero Token Transformer (ZTT), a novel method that enhances large language models' efficiency by cyclically refining intermediate layers with a zero-token mechanism, enabling dynamic early exits and better performance under limited parameters.

Contribution

The paper proposes a head-tail decoupled parameter cycling method and a zero-token mechanism to improve LLM efficiency and adaptability without increasing parameter counts.

Findings

01

Achieves better performance under tight parameter budgets.

02

Enables dynamic early exits based on attention scores.

03

Reduces computational overhead while maintaining accuracy.

Abstract

Resource limitations often constrain the parameter counts of Large Language Models (LLMs), hindering their performance. While existing methods employ parameter sharing to reuse the same parameter set under fixed budgets, such approaches typically force each layer to assume multiple roles with a predetermined number of iterations, restricting efficiency and adaptability. In this work, we propose the Zero Token Transformer (ZTT), which features a head-tail decoupled parameter cycling method. We disentangle the first (head) and last (tail) layers from parameter cycling and iteratively refine only the intermediate layers. Furthermore, we introduce a Zero-Token Mechanism, an internal architectural component rather than an input token, to guide layer-specific computation. At each cycle, the model retrieves a zero token (with trainable key values) from a Zero-Token Pool, integrating it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Scientific Computing and Data Management

MethodsByte Pair Encoding · Dense Connections · Residual Connection · Absolute Position Encodings · Linear Layer · Layer Normalization · Label Smoothing · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer