LLMBoost: Make Large Language Models Stronger with Boosting
Zehao Chen, Tianxiang Ai, Yifei Li, Gongxun Li, Yuyang Wei, Wang Zhou, Guanghui Li, Bin Yu, Zhijun Chen, Hailong Sun, Fuzhen Zhuang, Jianxin Li, Deqing Wang, Yikun Ban

TL;DR
LLMBoost introduces a novel ensemble fine-tuning framework for LLMs that leverages intermediate states, hierarchical error correction, and near-parallel inference to improve accuracy and efficiency.
Contribution
It presents a new boosting-inspired method that explicitly utilizes internal model states and a chain training paradigm for enhanced ensemble learning of LLMs.
Findings
Consistently improves accuracy on reasoning tasks
Reduces inference latency significantly
Theoretically guarantees monotonic improvements
Abstract
Ensemble learning of LLMs has emerged as a promising alternative to enhance performance, but existing approaches typically treat models as black boxes, combining the inputs or final outputs while overlooking the rich internal representations and interactions across models.In this work, we introduce LLMBoost, a novel ensemble fine-tuning framework that breaks this barrier by explicitly leveraging intermediate states of LLMs. Inspired by the boosting paradigm, LLMBoost incorporates three key innovations. First, a cross-model attention mechanism enables successor models to access and fuse hidden states from predecessors, facilitating hierarchical error correction and knowledge transfer. Second, a chain training paradigm progressively fine-tunes connected models with an error-suppression objective, ensuring that each model rectifies the mispredictions of its predecessor with minimal…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is well-written and easy to follow. 2. The idea of chain training paradigm is novel, interesting and straightforward.
1. Given that LLMBoost significantly increases computational cost and GPU memory usage during both training and inference, the performance gains appear relatively modest. 2. LLMBoost introduces a range of hyperparameters, which makes training and inference more complex, and for some of them (e.g., $\beta$ in Equation 6), the paper does not discuss how to set them. 3. The Error-Suppression Objective seems a bit strange to me. It maximizes the probability margin between the ground-truth token and
1. The idea of boosting by incorporating previous model correction seems to be relatively novel in the realm of LLMs with cross-model attention fusion. However, I have some concerns about the evaluation discussed in W1. 2. The results indicate that the ensembled model improves performance over the baseline approach by a decent margin on commonsense reasoning and arithmetic reasoning benchmarks.
1. Unfair capacity comparison. The ensembles (e.g., 3×8B = 24B total parameters) are not compared against a single 24B model or an equivalently fast system (perhaps a 16B model is faster than the 3-model ensemble with a better performance?), so the true efficiency–accuracy trade-off is unclear. I recommend that the authors include such results and clearly explain the setting for comparison. 2. The evaluated benchmarks in this work include arithmetic and commonsense reasoning. In order to really
- **Idea:** Ensembling *inside* the network (state sharing) instead of only outputs is interesting. - **Design details:** Clear components (cross-model attention, error-token forwarding, top-k backward fusion) with concrete formulas and a training algorithm. - **Theory:** Gives a clean MSE-based guarantee (with assumptions) and interpretable role of the scaling $ \lambda $. - **Empirics:** Broad tasks, multiple sizes (3B–8B, Qwen/Llama); monotonic gains as $n$ increases (diminishing returns beyo
- **Representation compatibility & alignment.** Requires homogeneous model families (matching layer shapes/hidden formats); no cross-family alignment is provided. Current scope does not address heterogeneous LLMs—limiting generality. - **Efficiency & complexity.** Training/inference require multiple full LLMs with state exchange; results compare sequential vs near-parallel *within* LLMBOOST, but provide no wall-clock/token-cost comparison vs external baselines. Memory/throughput implications (fo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
