Debate Helps Weak-to-Strong Generalization
Hao Lang, Fei Huang, Yongbin Li

TL;DR
This paper explores how debate between models can improve weak supervision and enhance the alignment of AI systems, especially when human supervision is limited, by leveraging strong models to guide weaker ones.
Contribution
It introduces a novel approach combining debate with weak supervision to improve model alignment, demonstrating empirical success on NLP benchmarks.
Findings
Debate helps weak models extract trustworthy info from strong models.
Ensemble of weak models exploits long arguments for robust supervision.
Combination of debate and weak supervision improves alignment results.
Abstract
Common methods for aligning already-capable models with desired behavior rely on the ability of humans to provide supervision. However, future superhuman models will surpass the capability of humans. Therefore, humans will only be able to weakly supervise superhuman models. This expected deficiency of human evaluation would weaken the safety of future AI systems. Scalable oversight and weak-to-strong generalization are two complementary approaches to tackle this issue. In this paper, we attempt to combine the strengths of these two approaches to further improve alignment. Specifically, we investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. To make iterative empirical progress, we consider an analogy: can we use a strong model to improve weak model supervision and then use it to supervise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Ethics and Social Impacts of AI
