Bayesian WeakS-to-Strong from Text Classification to Generation
Ziyun Cui, Ziyang Zhang, Guangzhi Sun, Wen Wu, Chao Zhang

TL;DR
This paper introduces WeakS-to-Strong, a Bayesian ensemble approach that enhances weak supervision for large language models, extending from classification to generation tasks and improving model reliability and alignment.
Contribution
It extends Weak-to-Strong to WeakS-to-Strong by incorporating Bayesian ensemble methods and applies it to both text classification and generation, advancing supervision strategies.
Findings
Effective in improving model reliability and superalignment.
Successful extension from classification to generation tasks.
Bayesian confidence scores guide better supervision.
Abstract
Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Furthermore, we extend the application of WeakS-to-Strong from text classification tasks to text generation tasks where more advanced strategies are investigated for supervision. Moreover, direct preference optimization is applied to advance the student model's preference learning, beyond the basic learning framework of teacher…
Peer Reviews
Decision·ICLR 2025 Poster
**(S1) Originality and Significance** Studying the WeakS-to-Strong setting is well-motivated and using the proposed Bayesian method here is novel. Furthermore, the expansion to from binary verification tasks to generation is notable since the latter more realistically reflects LLM use cases. **(S2) Quality** The paper’s proposed method results in notable improvements on the studied tasks, outperforming the individual weak model teachers in isolation and the naive baseline that simply does a we
**(W1) Significance of some results.** I found that certain claims in the manuscript were not as strongly supported by the empirical results. For instance, it appears that the improvements obtained from the DPO-based stage weren't too big (i.e., within standard deviations across runs). Another case was the claim in L461 about the auxiliary loss improving Bayesian methods. That being said, this doesn't affect my opinion about what I consider to be the main result, i.e., the superiority of Bayesia
**Originality**: The paper is the first to successfully extend weak-to-strong to the weaks-to-strong setting. The paper also contains several useful ideas for performing sequence-level training using soft-labels when the student and the teacher have different tokenizers. **Clarity**: The paper is written well and is easy to follow. **Quality**: The experiments are well structured to answer the central questions posed by the paper, i.e., is weaks-to-strong better than weak-to-strong, and is the
1. Some of the design choices are not well justified. For example, for the sequence-level training using soft labels, it is not clear why the strategy proposed in section 4.1 should work or is optimal. 2. The datasets selected for the classification and generation tasks seem very specific. Is there a particular reason for selecting only these? The impact of the paper could be much better if model general datasets were included, at least for the generation task.
- The proposal is simple but works.
- The motivation for using Dirichlet distribution to model the prior is ambiguous. - The writing for the methodology description is unclear and hard to follow - The transformations in Equation 3 should be clarified. - What is the definition of $p_k$ in Equation 4? - In Section 4.1, the probability of the target token of the strong model changing after each update of network parameters because $C_s(s_1)$ and $C_s(s_2)$ shift. This means that the targets vary along the training, likely creating in
Videos
Taxonomy
TopicsText and Document Classification Technologies
