TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

Zhong-Zhi Li; Xiao Liang; Zihao Tang; Lei Ji; Peijie Wang; Haotian Xu; Xing W; Haizhen Huang; Weiwei Deng; Yeyun Gong; Zhijiang Guo; Xiao Liu; Fei Yin; Cheng-Lin Liu

arXiv:2506.02678·cs.CL·June 17, 2025

TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

Zhong-Zhi Li, Xiao Liang, Zihao Tang, Lei Ji, Peijie Wang, Haotian Xu, Xing W, Haizhen Huang, Weiwei Deng, Yeyun Gong, Zhijiang Guo, Xiao Liu, Fei Yin, Cheng-Lin Liu

PDF

Open Access 1 Repo 4 Reviews

TL;DR

This paper introduces a dynamic re-weighting training method for large language models that reduces reasoning output length by nearly 40% without sacrificing accuracy, improving inference efficiency.

Contribution

The authors propose a novel ratio-based training pipeline that balances reasoning data to eliminate redundancy without complex annotations or multiple models.

Findings

01

Reduces output tokens by nearly 40%

02

Maintains reasoning accuracy

03

Validated on multiple models and benchmarks

Abstract

Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning--especially during inference with extremely long outputs--has drawn increasing attention from the research community. In this work, we propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations or interpolation between multiple models. We continuously balance the weights between the model's System-1 and System-2 data to eliminate redundant reasoning processes while preserving the model's reasoning capability. We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels. Our method significantly reduces the number of output tokens…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

1. The data mixing idea to make models generate concise yet correct reasoning traces is new to me. 2. The experiments were done on multiple datasets and models 3. The model performance is consistently similar or better.

Weaknesses

1. The paper needs to be proof-read by a native english speaker, as there are certain grammar issues. Example: "on reasoning LLMs, enabling the model to learn to generate more concise $\textbf{yet still}$ correct reasoning paths."-- use "yet". 2. The authors did not compare with training free reasoning compression methods ([1-2]). 3. The overhead of manual selection easy to hard data as well as system-1, 2 categorization has additional labeling overhead in their SFT process. [1] SEAL: Steerab

Reviewer 02Rating 6Confidence 3

Strengths

- Compressing reasoning length for improved efficiency is an important and timely research direction, and the proposed method demonstrates strong empirical effectiveness. - The paper includes extensive ablation studies, providing thorough and convincing evaluations. - The proposed approach is clearly presented, easy to follow, and supported by released code, enhancing reproducibility.

Weaknesses

- I am somewhat unconvinced about the necessity of the dynamic re-weighting strategy in Algorithm 1. An ablation study comparing different re-weighting strategies would help clarify its contribution. For example, including simple baselines such as fixed curriculum ratios (large-to-small, small-to-large, or random re-weighting) could provide a clearer understanding of the proposed method’s effectiveness. - It would be helpful to clarify the inference settings in the experimental setup. For insta

Reviewer 03Rating 2Confidence 3

Strengths

* How to tune the length of reasoning is a well-known issue related to LLM performance. The paper addresses a key challenge in the area by proposing a new method. * Empirically, the proposed method strikes a sweet balance between accuracy, inference reasoning length (or inference efficiency), and training efficiency (compared to RL). * Concrete experiments include extensive ablation studies revealing multiple insights.

Weaknesses

* The presentation is not clear, lacking citations. Many terms are not well defined. - In Line 40-41, the argument is not citing any prior work. It is not clear which work is the mainstream model merging that represents training-free methods. - In the paragraph starting from Line 73, the meaning of long CoT compression is not well defined. It is not clear if the proposed research is on training-based or training-free methods. - In Line 86-87, there is no citation for clarifying which GSM

Reviewer 04Rating 4Confidence 3

Strengths

Clear Motivation & Strong Rationale: The paper is well-motivated by a practical problem (inference efficiency). The analysis in Section 2, which shows that naively mixing data fails, provides a strong and clear justification for the necessity of the proposed dynamic re-weighting approach. Method Simplicity and Novelty: The TLDR method is elegant. By reformulating the compression problem as a dynamic data-weighting task solved with SFT, it avoids the high complexity and instability of reward-bas

Weaknesses

Limited Evaluation Domain: The experiments are conducted exclusively on mathematical reasoning datasets (GSM8K, MATH, AIME, etc.). It is unclear if this "System-1/System-2" data paradigm and the TLDR method will generalize to other reasoning domains, such as commonsense reasoning (e.g., HellaSwag), code generation, or long-form creative/factual writing. Dependency on Curated "Hard" Data: The method's success seems to rely on the availability of a high-quality "difficult" dataset (like S1) to so

Code & Models

Repositories

zzli2022/tldr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training