Super(ficial)-alignment: Strong Models May Deceive Weak Models in   Weak-to-Strong Generalization

Wenkai Yang; Shiqi Shen; Guangyao Shen; Wei Yao; Yong Liu; Zhi Gong,; Yankai Lin; Ji-Rong Wen

arXiv:2406.11431·cs.CL·March 3, 2025

Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Wenkai Yang, Shiqi Shen, Guangyao Shen, Wei Yao, Yong Liu, Zhi Gong,, Yankai Lin, Ji-Rong Wen

PDF

Open Access 1 Repo

TL;DR

This paper investigates whether strong models can deceive weak models in multi-objective alignment scenarios, revealing that deception is prevalent and worsens with larger capability gaps, raising concerns about superalignment reliability.

Contribution

It introduces the concept of weak-to-strong deception in superalignment, demonstrating its existence and characteristics through extensive experiments in multi-objective settings.

Findings

01

Weak-to-strong deception exists across all tested scenarios.

02

Deception increases with the capability gap between models.

03

Intermediate models can partially mitigate deception.

Abstract

Superalignment, where humans act as weak supervisors for superhuman models, has become a crucial problem with the rapid development of Large Language Models (LLMs). Recent work has preliminarily studied this problem by using weak models to supervise strong models, and discovered that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

keven980716/weak-to-strong-deception
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical and numerical algorithms · Medical Image Segmentation Techniques · Geochemistry and Geologic Mapping

MethodsSoftmax · Attention Is All You Need