Your Weak LLM is Secretly a Strong Teacher for Alignment
Leitian Tao, Yixuan Li

TL;DR
This paper investigates how weak, resource-efficient LLMs can serve as effective and scalable teachers for alignment, potentially surpassing human feedback in quality and reducing costs.
Contribution
It demonstrates that weak LLMs can generate high-quality feedback for alignment, offering a scalable alternative to human annotation with minimal impact from model size.
Findings
Weak LLMs can produce feedback comparable to human annotations.
Model size has limited effect on feedback quality.
Weak LLM feedback can outperform human feedback in some cases.
Abstract
The burgeoning capabilities of large language models (LLMs) have underscored the need for alignment to ensure these models act in accordance with human values and intentions. Existing alignment frameworks present constraints either in the form of expensive human effort or high computational costs. This paper explores a promising middle ground, where we employ a weak LLM that is significantly less resource-intensive than top-tier models, yet offers more automation than purely human feedback. We present a systematic study to evaluate and understand weak LLM's ability to generate feedback for alignment. Our empirical findings demonstrate that weak LLMs can provide feedback that rivals or even exceeds that of fully human-annotated data. Our study indicates a minimized impact of model size on feedback efficacy, shedding light on a scalable and sustainable alignment strategy. To deepen our…
Peer Reviews
Decision·ICLR 2025 Poster
- The research topic in this paper is critical and popular. Given the high costs associated with human and large model annotations, exploring the potential of weaker LLMs in providing alignment feedback is a timely and necessary endeavor. - This paper is well-written and easy-to-follow, with an introduction to associated alignment background techniques. - A dedicated discussion section highlights the differences between this work and the most relevant prior studies.
- More comprehensive experiments are required to validate the reliability of the primary conclusions in this paper, including but not limited to the following aspects. + The most relevant baselines with the idea of RLAIF. Beyond that the original RLAIF methods rely on large LLMs to provide feedback, it is crucial to delineate differences between the proposed training pipeline and RLAIF. Additionally, a comparison with RLAIF employing weaker LLMs as the reward model would strengthen the findi
1. This paper proposes a cost-effective alignment framework that reduces computational costs and reliance on extensive human feedback, using weak LLMs that have significantly fewer parameters than high-capacity models. 2. The study demonstrates that weak LLMs can perform as well as, and in some cases better than, human annotators in generating preference feedback for alignment through extensive experiments. For example, by demonstrating similar alignment success across different model families
1. The paper lacks a thorough analysis of feedback consistency over repeated evaluations. For example, when weak LLM and human preferences conflict, the study does not adequately explore the stability of weak LLM feedback. A quantitative analysis of consistency across multiple runs or various weak models would be helpful to assess whether the weak LLM’s feedback is repeatable and reliable. 2. The evaluation metrics are only based on 2 models, the gold reward model and GPT-4. If the models are
1.The paper introduces a framework that leverages weak LLMs for alignment, which offers a promising alternative to existing methods that rely of either extensive human labor or expensive computational resources. 2.The paper conducts comprehensive evaluation of weak LLM feedback across various model scales and families. The result consistently demonstrate the effectiveness of this approach, highlighting its potential for practical application. 3.The paper is well-organized, with clear sections
1.In this paper, the authors propose a framework that utilizes weak LLMs for alignment, striving to strike a balance between RLHF and RLAIF, it appears that the weak-to-strong approach and methodology in the paper are still quite similar to Burns et al. (2024). Burns et al. (2024)’s study focus on reward modeling and binary classification, where the outputs in this paper are either scalar or categorical labels. Your study is closely tied to alignment and involves a more complex output space. How
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · Translation Studies and Practices
