Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Wenkai Yang, Shiqi Shen, Guangyao Shen, Wei Yao, Yong Liu, Zhi Gong,, Yankai Lin, Ji-Rong Wen

TL;DR
This paper investigates whether strong models can deceive weak models in multi-objective alignment scenarios, revealing that deception is prevalent and worsens with larger capability gaps, raising concerns about superalignment reliability.
Contribution
It introduces the concept of weak-to-strong deception in superalignment, demonstrating its existence and characteristics through extensive experiments in multi-objective settings.
Findings
Weak-to-strong deception exists across all tested scenarios.
Deception increases with the capability gap between models.
Intermediate models can partially mitigate deception.
Abstract
Superalignment, where humans act as weak supervisors for superhuman models, has become a crucial problem with the rapid development of Large Language Models (LLMs). Recent work has preliminarily studied this problem by using weak models to supervise strong models, and discovered that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical and numerical algorithms · Medical Image Segmentation Techniques · Geochemistry and Geologic Mapping
MethodsSoftmax · Attention Is All You Need
