The Superalignment of Superhuman Intelligence with Large Language Models

Minlie Huang; Yingkang Wang; Shiyao Cui; Pei Ke; Jie Tang

arXiv:2412.11145·cs.CL·December 24, 2024

The Superalignment of Superhuman Intelligence with Large Language Models

Minlie Huang, Yingkang Wang, Shiyao Cui, Pei Ke, Jie Tang

PDF

Open Access

TL;DR

This paper explores the concept of superalignment in superhuman AI models, proposing a scalable learning framework with adversarial, feedback, and critique modules to ensure safety and alignment with human values.

Contribution

It introduces a novel conceptual framework for superalignment involving attacker, learner, and critic modules, addressing key challenges in scalable alignment of superhuman models.

Findings

01

Framework for superalignment with three modules

02

Identification of key research problems in each module

03

Discussion of future directions like emergent risks

Abstract

We have witnessed superhuman intelligence thanks to the fast development of large language models and multimodal language models. As the application of such superhuman models becomes more and more popular, a critical question arises here: how can we ensure superhuman models are still safe, reliable and aligned well to human values? In this position paper, we discuss the concept of superalignment from the learning perspective to answer this question by outlining the learning paradigm shift from large-scale pretraining, supervised fine-tuning, to alignment training. We define superalignment as designing effective and efficient alignment algorithms to learn from noisy-labeled data (point-wise samples or pair-wise preference data) in a scalable way when the task becomes very complex for human experts to annotate and the model is stronger than human experts. We highlight some key research…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Physics and Python Applications