Improving Weak-to-Strong Generalization with Reliability-Aware Alignment
Yue Guo, Yi Yang

TL;DR
This paper introduces a reliability-aware alignment method for large language models that improves their ability to generalize from imperfect supervision signals by estimating and utilizing the reliability of weak labels.
Contribution
The paper proposes a novel approach that incorporates answer reliability estimation into the alignment process to enhance weak-to-strong generalization in LLMs.
Findings
Effective identification of weak label quality
Significant improvement in generalization performance
Enhanced robustness to noisy supervision
Abstract
Large language models (LLMs) are now rapidly advancing and surpassing human abilities on many natural language tasks. However, aligning these super-human LLMs with human knowledge remains challenging because the supervision signals from human annotators may be wrong. This issue, known as the "super-alignment" problem, requires enhancing weak-to-strong generalization, where a strong LLM must generalize from imperfect supervision provided by a weaker source. To address this issue, we propose an approach to improve weak-to-strong generalization by involving the reliability of weak supervision signals in the alignment process. In our method, we query the weak supervisor for multiple answers, estimate the answer reliability, and enhance the alignment process by filtering out uncertain data or re-weighting reliable data. Experiments on four datasets demonstrate that our methods effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Fault Detection and Control Systems · Anomaly Detection Techniques and Applications
