CyclicReflex: Improving Reasoning Models via Cyclical Reflection Token Scheduling
Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred Hero, Sijia Liu

TL;DR
CyclicReflex introduces a novel, training-free decoding strategy that adaptively schedules reflection tokens in reasoning models, improving their performance by balancing reflection frequency without extra computation.
Contribution
The paper proposes CyclicReflex, a dynamic, position-dependent reflection token scheduling method that enhances reasoning model performance by regulating reflection token usage without additional training.
Findings
CyclicReflex outperforms standard decoding methods across multiple benchmarks.
Optimal reflection token scheduling improves reasoning accuracy.
The approach is effective across various model sizes from 1.5B to 14B parameters.
Abstract
Large reasoning models (LRMs), such as OpenAI's o1 and DeepSeek-R1, harness test-time scaling to perform multi-step reasoning for complex problem-solving. This reasoning process, executed before producing final answers, is often guided by special juncture tokens that prompt self-evaluative reflection. These transition markers and reflective cues are referred to as "reflection tokens" (e.g., "wait", "but", "alternatively"). In this work, we treat reflection tokens as a "resource" and introduce the problem of resource allocation, aimed at improving the test-time compute performance of LRMs by adaptively regulating the frequency and placement of reflection tokens. Through empirical analysis, we show that both excessive and insufficient use of reflection tokens, referred to as over-reflection and under-reflection, can degrade model performance. To better understand this trade-off, we draw…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper frames the problem from a novel and interesting perspective: treating reflection tokens as a computational resource and formalizing their management as a scheduling problem. - The analogy between reflection scheduling and cyclical learning rates is creative and highly intuitive. The resulting method is simple, training-free, and adds minimal computational overhead, which makes it very attractive for practical application. - The paper provides good analysis, including visualizations
- The definition of "reflection tokens" is subjective and relies on a pre-defined, fixed set. The method's dependency on this curated list is a potential vulnerability, as it may not capture all forms of model reflection. - The core motivation is based on an analogy. While the "landscape of thought" metaphor is appealing, it lacks a rigorous theoretical foundation and strict validation. It is not as formally defined as the "loss landscape" it draws inspiration from, making the connection feel m
1. The paper is well-written and logically structured, clearly motivating the problem of under- and over-reflection and leading to the proposed solution. 2. The paper addresses the important problem of managing LRM reasoning paths. Framing reflection token generation as a resource allocation problem and drawing an analogy to learning rate scheduling in optimization is clear and novel. 3. The proposed CyclicReflex method demonstrates strong and consistent accuracy improvements across a diverse s
**Weaknesses:** 1. I have some questions on experimental setup: - The `max_token` limit is set to 8192. It is unclear if this is sufficiently long for R1-style CoT generation, as significant truncation could prematurely cut off reasoning paths. This could potentially confound the analysis of under-reflection (cut off before completion) and over-reflection (never fully realized), thus impacting the reliability of the results. - The hyperparameter search space for the period $C$ seems so
* The core idea of treating reflection tokens as a schedulable resource and drawing an analogy to learning rate scheduling is highly original and compelling. * CyclicReflex is simple to implement, incurs no additional training cost, and demonstrates consistent, non-trivial performance gains across a wide range of tasks and model scales. * The experiments are thorough, covering multiple datasets (MATH, AIME, AMC, GPQA, LiveCodeBench), model families (Qwen, LLaMA), and integration with other tes
* The method introduces two new hyperparameters (amplitude A, period C). The grid search ranges provided in Appendix A are helpful, but the paper could do more to provide heuristics or guidelines for setting these parameters on new tasks or models, as optimal values seem to vary by dataset (e.g., different C for MATH500 vs. AIME). * The set of reflection tokens V_hat is a critical component, but its definition and potential sensitivity are not deeply discussed. Is the performance robust to the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
