It's Not That Simple. An Analysis of Simple Test-Time Scaling
Guojun Wu

TL;DR
This paper analyzes simple test-time scaling in language models, revealing that scaling down by maximum length enforcement drives observed behavior, while scaling up with 'Wait' is inconsistent, emphasizing the importance of natural test-time compute scaling for improved performance.
Contribution
The paper clarifies the mechanisms behind test-time scaling, distinguishing between scaling down and scaling up, and highlights the benefits of learning to scale up during reinforcement learning.
Findings
Scaling down is the primary driver of observed scaling behavior.
Fine-tuning on long data does not significantly affect scaling.
Scaling up test-time compute can lead to performance improvements.
Abstract
Prior work proposed simple test-time scaling, a method for replicating this scaling behavior with models distilled from o1-like models by manually controlling test-time compute: either scaling down by enforcing a maximum length or scaling up by iteratively appending "Wait" when the model is about to terminate its generation. This paper presents an analysis of simple test-time scaling and finds that the scaling behavior is largely attributed to scaling down by enforcing a maximum length. In contrast, fine-tuning on long CoT data distilled from o1-like models has no significant impact on scaling behavior, and scaling up by appending "Wait" leads to inconsistencies, as the model may oscillate between solutions. A key distinction exists between scaling down by enforcing a maximum length and scaling up test-time compute in o1-like models, such as DeepSeek-R1\@. These models are typically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
