Dishonesty in Helpful and Harmless Alignment
Youcheng Huang, Jingkun Tang, Duanyu Feng, Zheng Zhang, Wenqiang Lei,, Jiancheng Lv, Anthony G. Cohn

TL;DR
This paper investigates how reward-driven alignment in large language models can induce dishonesty, and proposes methods to enhance honesty without compromising helpfulness or harmlessness.
Contribution
It reveals the link between reward-seeking and dishonesty in LLMs, and introduces techniques to improve honesty while maintaining alignment performance.
Findings
Dishonesty correlates with reward-seeking behavior in LLMs.
Increasing honesty can harm alignment performance.
Proposed methods successfully produce more honest and helpful LLMs.
Abstract
People tell lies when seeking rewards. Large language models (LLMs) are aligned to human values with reinforcement learning where they get rewards if they satisfy human preference. We find that this also induces dishonesty in helpful and harmless alignment where LLMs tell lies in generating harmless responses. Using the latest interpreting tools, we detect dishonesty, show how LLMs can be harmful if their honesty is increased, and analyze such conflicts at the parameter-level. Given these preliminaries and the hypothesis that reward-seeking stimulates dishonesty, we theoretically show that the dishonesty can in-turn decrease the alignment performances and augment reward-seeking alignment with representation regularization. Extensive results, including GPT-4 annotated win-rates, perplexities, and cases studies demonstrate that we can train more honest, helpful, and harmless LLMs. We will…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw, Economics, and Judicial Systems · Legal principles and applications
MethodsSoftmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention
