Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster   Adaptive Internal Thinking

Yilong Chen; Junyuan Shang; Zhenyu Zhang; Yanxi Xie; Jiawei Sheng,; Tingwen Liu; Shuohuan Wang; Yu Sun; Hua Wu; Haifeng Wang

arXiv:2502.13842·cs.CL·February 25, 2025

Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking

Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng,, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

PDF

Open Access

TL;DR

The Inner Thinking Transformer introduces dynamic depth scaling and adaptive computation to improve reasoning in language models without increasing parameters, achieving high performance with fewer resources.

Contribution

It proposes a novel architecture that dynamically allocates computation for critical tokens, enabling deeper reasoning without parameter growth.

Findings

01

Achieves 96.5% performance of larger models with fewer parameters.

02

Reduces training data requirements by 43.2%.

03

Outperforms existing Transformer variants on multiple benchmarks.

Abstract

Large language models (LLMs) face inherent performance bottlenecks under parameter constraints, particularly in processing critical tokens that demand complex reasoning. Empirical analysis reveals challenging tokens induce abrupt gradient spikes across layers, exposing architectural stress points in standard Transformers. Building on this insight, we propose Inner Thinking Transformer (ITT), which reimagines layer computations as implicit thinking steps. ITT dynamically allocates computation through Adaptive Token Routing, iteratively refines representations via Residual Thinking Connections, and distinguishes reasoning phases using Thinking Step Encoding. ITT enables deeper processing of critical tokens without parameter expansion. Evaluations across 162M-466M parameter models show ITT achieves 96.5\% performance of a 466M Transformer using only 162M parameters, reduces training data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCognitive Science and Mapping · Design Education and Practice

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax