TL;DR
This paper introduces Shift-FFN, a novel neural network module that amplifies differences between adjacent token representations to improve long chain-of-thought reasoning in large language models, reducing cyclical reasoning and enhancing accuracy.
Contribution
The paper proposes Shift-FFN, a new architecture that enhances long CoT reasoning by dynamically amplifying adjacent token differences, addressing cyclical reasoning issues in fine-tuned models.
Findings
Shift-FFN reduces cyclical reasoning in models.
Combining LoRA with Shift-FFN improves accuracy.
Shift-FFN outperforms standard fine-tuning methods.
Abstract
Recently, models such as OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable performance on complex reasoning tasks through Long Chain-of-Thought (Long-CoT) reasoning. Although distilling this capability into student models significantly enhances their performance, this paper finds that fine-tuning LLMs with full parameters or LoRA with a low rank on long CoT data often leads to Cyclical Reasoning, where models repeatedly reiterate previous inference steps until the maximum length limit. Further analysis reveals that smaller differences in representations between adjacent tokens correlates with a higher tendency toward Cyclical Reasoning. To mitigate this issue, this paper proposes Shift Feedforward Networks (Shift-FFN), a novel approach that edits the current token's representation with the previous one before inputting it to FFN. This architecture dynamically amplifies the…
Peer Reviews
Decision·Submitted to ICLR 2026
Rich empirical studies - they explored from Qwen2.5-(3B/7B) to Llama3.1.-8B across the datasets AIME24, AMC23, MATH500, and Olympiad - Fig 4 - meaningful to see how much reasoning is repeated and the length does not impact much on adding some new directions in their thinking procedures. - Table 2 - good results with the reduced length exceeded percentage M(X) computation is an interesting feature to explore the overall relative change of the embedding trajectories across the layers and the Fig
The datasets are too much focused on mathematical reasoning, so we are not sure how much this finding is general to overall reasoning mehcanisms of LLMs. Table 1 - it seems the full models frequently (almost half the settings) perform better than the LoRA approaches + their approaches - it seems Llama model is not doing great with the proposed approach though Qwen (3B) does work. Qwen (7B) shows mixed results. - this weakens the implications of the work, "Experimental results demonstrate that
## Originality The paper identifies a novel problem - Cyclical Reasoning in long CoT fine-tuning - and proposes an original architectural solution (Shift-FFN) that dynamically amplifies representation differences between adjacent tokens. The connection between low adjacent token divergence and cyclical behavior is a creative observation that differs from existing PEFT approaches. ## Quality The experimental evaluation is comprehensive, covering multiple model sizes (3B, 7B, 8B), diverse mathema
## 1. Insufficient justification for using Length Exceeded as a proxy for Cyclical Reasoning (Lines 105, 319) The paper claims that responses exceeding the maximum length limit indicate Cyclical Reasoning, but this causal relationship is inadequately established. While the authors mention removing "all repeated text segments" from truncated responses, they provide no quantitative analysis of what proportion of length-exceeded samples actually contain repetitive text. - In Section 3.1 (line 105
1. The paper presents the intriguing hypothesis that Cyclical Reasoning during long CoT distillation is more likely to occur as the representations of adjacent tokens become more similar. The authors do not merely report the phenomenon to support this claim but also empirically prove it by analyzing the similarity between adjacent token representations (i.e., a decrease in the $M(X)$). This provides a strong motivation by linking the superficial symptoms of the problem to its potential root caus
1. The related work section is slightly misaligned with the paper's core problem (it primarily focuses on PEFT and Long-CoT distillation, not Cyclical Reasoning). The ‘Cyclical Reasoning phenomenon is, in fact, a well-established issue previously studied under different names such as ‘Generation Loops’ or ‘Mode Collapse’. The paper's novelty would have been emphasized by directly comparing its architectural-level solution against other existing decoding-based solutions (e.g., [1, 2]). \ [1] Li,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
