ConTrans: Weak-to-Strong Alignment Engineering via Concept   Transplantation

Weilong Dong; Xinwei Wu; Renren Jin; Shaoyang Xu; Deyi Xiong

arXiv:2405.13578·cs.CL·December 31, 2024

ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation

Weilong Dong, Xinwei Wu, Renren Jin, Shaoyang Xu, Deyi Xiong

PDF

Open Access 1 Repo

TL;DR

ConTrans introduces a novel method for transferring aligned concepts from smaller, weakly aligned language models to larger, unaligned models, improving alignment efficiency and effectiveness.

Contribution

It proposes a concept transplantation framework that refines and reformulates concept vectors for effective alignment transfer across different LLMs.

Findings

01

Successful transplantation of aligned concepts across various LLM sizes and families.

02

ConTrans surpasses instruction-tuned models in truthfulness.

03

Effective weak-to-strong alignment transfer demonstrated.

Abstract

Ensuring large language models (LLM) behave consistently with human goals, values, and intentions is crucial for their safety but yet computationally expensive. To reduce the computational cost of alignment training of LLMs, especially for those with a huge number of parameters, and to reutilize learned value alignment, we propose ConTrans, a novel framework that enables weak-to-strong alignment transfer via concept transplantation. From the perspective of representation engineering, ConTrans refines concept vectors in value alignment from a source LLM (usually a weak yet aligned LLM). The refined concept vectors are then reformulated to adapt to the target LLM (usually a strong yet unaligned base LLM) via affine transformation. In the third step, ConTrans transplants the reformulated concept vectors into the residual stream of the target LLM. Experiments demonstrate the successful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

willowdong/contrans
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDNA and Biological Computing · Machine Learning and Algorithms

MethodsBalanced Selection