LLMBoost: Make Large Language Models Stronger with Boosting

Zehao Chen; Tianxiang Ai; Yifei Li; Gongxun Li; Yuyang Wei; Wang Zhou; Guanghui Li; Bin Yu; Zhijun Chen; Hailong Sun; Fuzhen Zhuang; Jianxin Li; Deqing Wang; Yikun Ban

arXiv:2512.22309·cs.LG·December 30, 2025

LLMBoost: Make Large Language Models Stronger with Boosting

Zehao Chen, Tianxiang Ai, Yifei Li, Gongxun Li, Yuyang Wei, Wang Zhou, Guanghui Li, Bin Yu, Zhijun Chen, Hailong Sun, Fuzhen Zhuang, Jianxin Li, Deqing Wang, Yikun Ban

PDF

Open Access 3 Reviews

TL;DR

LLMBoost introduces a novel ensemble fine-tuning framework for LLMs that leverages intermediate states, hierarchical error correction, and near-parallel inference to improve accuracy and efficiency.

Contribution

It presents a new boosting-inspired method that explicitly utilizes internal model states and a chain training paradigm for enhanced ensemble learning of LLMs.

Findings

01

Consistently improves accuracy on reasoning tasks

02

Reduces inference latency significantly

03

Theoretically guarantees monotonic improvements

Abstract

Ensemble learning of LLMs has emerged as a promising alternative to enhance performance, but existing approaches typically treat models as black boxes, combining the inputs or final outputs while overlooking the rich internal representations and interactions across models.In this work, we introduce LLMBoost, a novel ensemble fine-tuning framework that breaks this barrier by explicitly leveraging intermediate states of LLMs. Inspired by the boosting paradigm, LLMBoost incorporates three key innovations. First, a cross-model attention mechanism enables successor models to access and fuse hidden states from predecessors, facilitating hierarchical error correction and knowledge transfer. Second, a chain training paradigm progressively fine-tunes connected models with an error-suppression objective, ensuring that each model rectifies the mispredictions of its predecessor with minimal…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper is well-written and easy to follow. 2. The idea of chain training paradigm is novel, interesting and straightforward.

Weaknesses

1. Given that LLMBoost significantly increases computational cost and GPU memory usage during both training and inference, the performance gains appear relatively modest. 2. LLMBoost introduces a range of hyperparameters, which makes training and inference more complex, and for some of them (e.g., $\beta$ in Equation 6), the paper does not discuss how to set them. 3. The Error-Suppression Objective seems a bit strange to me. It maximizes the probability margin between the ground-truth token and

Reviewer 02Rating 4Confidence 3

Strengths

1. The idea of boosting by incorporating previous model correction seems to be relatively novel in the realm of LLMs with cross-model attention fusion. However, I have some concerns about the evaluation discussed in W1. 2. The results indicate that the ensembled model improves performance over the baseline approach by a decent margin on commonsense reasoning and arithmetic reasoning benchmarks.

Weaknesses

1. Unfair capacity comparison. The ensembles (e.g., 3×8B = 24B total parameters) are not compared against a single 24B model or an equivalently fast system (perhaps a 16B model is faster than the 3-model ensemble with a better performance?), so the true efficiency–accuracy trade-off is unclear. I recommend that the authors include such results and clearly explain the setting for comparison. 2. The evaluated benchmarks in this work include arithmetic and commonsense reasoning. In order to really

Reviewer 03Rating 6Confidence 4

Strengths

- **Idea:** Ensembling *inside* the network (state sharing) instead of only outputs is interesting. - **Design details:** Clear components (cross-model attention, error-token forwarding, top-k backward fusion) with concrete formulas and a training algorithm. - **Theory:** Gives a clean MSE-based guarantee (with assumptions) and interpretable role of the scaling $ \lambda $. - **Empirics:** Broad tasks, multiple sizes (3B–8B, Qwen/Llama); monotonic gains as $n$ increases (diminishing returns beyo

Weaknesses

- **Representation compatibility & alignment.** Requires homogeneous model families (matching layer shapes/hidden formats); no cross-family alignment is provided. Current scope does not address heterogeneous LLMs—limiting generality. - **Efficiency & complexity.** Training/inference require multiple full LLMs with state exchange; results compare sequential vs near-parallel *within* LLMBOOST, but provide no wall-clock/token-cost comparison vs external baselines. Memory/throughput implications (fo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques