TL;DR
This paper introduces virtual logical depth (VLD), a new dimension for scaling language models by reusing weights to increase effective reasoning depth without adding parameters, leading to improved reasoning capabilities.
Contribution
It explores virtual logical depth as a novel scaling dimension, demonstrating its ability to enhance reasoning without increasing parameter count and providing insights into future scaling strategies.
Findings
VLD maintains knowledge capacity at fixed parameters.
Proper VLD implementation improves reasoning without larger models.
Reasoning gains from VLD are consistent across architectures.
Abstract
Scaling large language models typically involves three dimensions: depth, width, and parameter count. In this work, we explore a fourth dimension, \textbf{virtual logical depth} (VLD), which increases effective algorithmic depth without changing parameter count by reusing weights. While parameter reuse is not new, its role in scaling has been underexplored. Unlike recent test-time methods that scale token-wise, VLD alters the internal computation graph during training and inference. Through controlled experiments, we obtain three key insights. (1) \textit{Knowledge capacity vs. parameters}: at fixed parameter count, VLD leaves knowledge capacity nearly unchanged, while across models capacity still scales with parameters. (2) \textit{Reasoning vs. reuse}: properly implemented VLD substantially improves reasoning ability \emph{without} more parameters, decoupling reasoning from size. This…
Peer Reviews
Decision·Submitted to ICLR 2026
- Simple, architecture-compatible idea that is easy to implement. - Clear synthetic improvements in reasoning at fixed parameter count. - Some external evidence from LLaMA-3B showing consistent gains.
- Limited generality due to GPT-2-centric evidence. The main claims, including “reasoning increases while knowledge capacity stays constant”, are derived almost entirely from the results based on GPT-2-small-scale models, making it unclear whether the observations extend to modern architectures such as GPT-5, Gemini, or DeepSeek-R1. - LLaMA-3B SFT results cannot validate the core theory. SFT naturally improves downstream performance, so gains after fine-tuning do not strictly support the claime
The paper makes several valuable contributions through rigorous empirical investigation. The systematic study of VLD as a scaling dimension fills an important gap, as parameter reuse dynamics have remained underexplored despite existing work on layer sharing. The experimental design is comprehensive and well-controlled, spanning both synthetic environments (random sequences for knowledge capacity, iGSM for controlled reasoning evaluation) and real-world benchmarks across multiple domains (mathem
Despite its empirical contributions, the paper has several significant limitations. Most critically, it lacks theoretical or mechanistic explanation for why VLD improves reasoning while maintaining constant knowledge capacity—the work is primarily observational without providing insights into the underlying computational principles. The definitions of "reasoning capability" versus "knowledge capacity" could be more rigorous; measuring reasoning through benchmark accuracy may not capture the full
From an empirical perspective, the work isolates a simple, reproducible manipulation—weight sharing across depth—and documents consistent improvements on controlled synthetic tasks, with some transfer to real benchmarks. The write-up is generally clear and the measurement protocol is spelled out in enough detail to re-implement, including the entropy-based capacity metric and the iGSM setup. The authors are transparent about occasional plateaus and regressions at higher VLD factors, which is app
Conceptual novelty is limited relative to prior work on depth-recurrence and cross-layer sharing. Universal Transformers introduced depth-time recurrence with tied parameters years ago, ALBERT formalized cross-layer sharing, and Takase & Kiyono precisely studied the same three tying patterns (sequence, cycle, reverse-cycle). The present paper mainly scales up the experiments without head-to-head, compute-matched comparisons against those baselines, making the “new scaling dimension” feel like a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
