The Superalignment of Superhuman Intelligence with Large Language Models
Minlie Huang, Yingkang Wang, Shiyao Cui, Pei Ke, Jie Tang

TL;DR
This paper explores the concept of superalignment in superhuman AI models, proposing a scalable learning framework with adversarial, feedback, and critique modules to ensure safety and alignment with human values.
Contribution
It introduces a novel conceptual framework for superalignment involving attacker, learner, and critic modules, addressing key challenges in scalable alignment of superhuman models.
Findings
Framework for superalignment with three modules
Identification of key research problems in each module
Discussion of future directions like emergent risks
Abstract
We have witnessed superhuman intelligence thanks to the fast development of large language models and multimodal language models. As the application of such superhuman models becomes more and more popular, a critical question arises here: how can we ensure superhuman models are still safe, reliable and aligned well to human values? In this position paper, we discuss the concept of superalignment from the learning perspective to answer this question by outlining the learning paradigm shift from large-scale pretraining, supervised fine-tuning, to alignment training. We define superalignment as designing effective and efficient alignment algorithms to learn from noisy-labeled data (point-wise samples or pair-wise preference data) in a scalable way when the task becomes very complex for human experts to annotate and the model is stronger than human experts. We highlight some key research…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications
