FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning
Haozheng Luo, Zhuolin Jiang, Md Zahid Hasan, Yan Chen, Soumalya Sarkar

TL;DR
FROST introduces an attention-based method to prune uncritical reasoning paths, reducing token usage and improving accuracy in reasoning models by removing reasoning outliers at the sentence level.
Contribution
FROST is the first to leverage attention weights for pruning reasoning outliers, enhancing efficiency and reliability of reasoning trajectories.
Findings
69.68% reduction in token usage
26.70% improvement in accuracy
Significant reduction in attention outlier metrics
Abstract
We propose FROST, an attention-aware method for efficient reasoning. Unlike traditional approaches, FROST leverages attention weights to prune uncritical reasoning paths, yielding shorter and more reliable reasoning trajectories. Methodologically, we introduce the concept of reasoning outliers and design an attention-based mechanism to remove them. Theoretically, FROST preserves and enhances the model's reasoning capacity while eliminating outliers at the sentence level. Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi-4-Reasoning and GPT-OSS-20B), outperforming state-of-the-art methods such as TALE and ThinkLess. Notably, FROST achieves an average 69.68% reduction in token usage and a 26.70% improvement in accuracy over the base model. Furthermore, in evaluations of attention outlier metrics, FROST reduces the maximum infinity norm by 15.97% and…
Peer Reviews
Decision·ICLR 2026 Poster
1.The authors clearly articulate and demonstrate the problem of outliers in reasoning through premise experiments, while also providing a simple yet highly effective solution. This is easy to understand and well-intentioned. 2.The experimental results on multiple datasets are significant and compared with the state-of-the-art algorithms, convincingly demonstrating the effectiveness of FROST. 3.In addition to basic performance experiments, the paper also includes a large number of validation expe
1.In Figure 4, the attention weight of S2 is also significantly compressed, while S3 is actually improved. I understand that S24 is the key reasoning path and has been improved. However, further analysis of S2 and S3 is necessary. 2.The authors demonstrate in Table 3 that removing attention outliers increases the probability assigned to critical sentences. Token entropy is used as an indicator of criticality. While most experiments meet expectations, Entmax15 and Sparsemax exhibit unexpected per
(1) It is an important topic to develop efficient LLMs that adaptively reduce computational overheads based on inputs. (2) The paper draws several observations on the attention characteristics of reasoning models, which facilitates understanding on their underlying decision-making process. (3) The proposed method shows generalizability across different models and datasets.
(1) Prior studies (e.g., Xiao et al, 2024, denoted as [ref1] in the remaining review) have already studied the use of softmax1 activation in developing efficient LLMs. Adopting the same methods on a specific reasoning scenario introduces rather limited technical contributions. In addition, the observations in the paper are also similar to previous ones, for instance, Figure 2 is similar to Figure 7 in [ref1], and [ref1] also pointed out the focus on specific tokens in deeper layers (initial toke
- The paper's primary strength lies in its empirical results. The method achieves a compelling combination of significantly reduced token usage while simultaneously improving accuracy across multiple benchmarks and models. This is a strong and desirable outcome for any efficiency-focused technique. - The approach is methodologically simple and elegant: swapping an activation function and performing a short SFT run. This makes the method practical and easily reproducible. - The supplementary expe
- My main concern is the limited novelty of the core mechanism. The paper frames the use of the Softmax1 function as a key contribution for removing "reasoning outliers." However, this exact function and its properties for suppressing outlier/low-value attention scores were previously discussed in other contexts, notably in Evan Miller's 2021 blog post "Attention is Off by One." While applying this function to reasoning chains is new, the underlying technique for attention modification is not, w
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
