Alignment for Honesty

Yuqing Yang; Ethan Chern; Xipeng Qiu; Graham Neubig; Pengfei Liu

arXiv:2312.07000·cs.CL·October 29, 2024·2 cites

Alignment for Honesty

Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, Pengfei Liu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper emphasizes the importance of aligning large language models with honesty, proposing metrics, benchmarks, and training methods to improve truthful responses without overly conservative behavior.

Contribution

It introduces a formal definition of honesty for LLMs, develops metrics and benchmarks to measure it, and presents a flexible fine-tuning framework to enhance honesty.

Findings

01

Aligned models show increased honesty according to proposed metrics

02

The training framework maintains performance on other tasks

03

Open-source resources facilitate future research

Abstract

Recent research has made significant strides in aligning large language models (LLMs) with helpfulness and harmlessness. In this paper, we argue for the importance of alignment for \emph{honesty}, ensuring that LLMs proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. However, a pivotal aspect of alignment for honesty involves discerning an LLM's knowledge boundaries, which demands comprehensive solutions in terms of metric development, benchmark creation, and training methodologies. We address these challenges by first establishing a precise problem definition and defining ``honesty'' inspired by the Analects of Confucius. This serves as a cornerstone for developing metrics that effectively measure an LLM's honesty by quantifying its progress post-alignment. Furthermore, we introduce a flexible training framework which is further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gair-nlp/alignment-for-honesty
noneOfficial

Videos

Alignment for Honesty· slideslive

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI · Software Engineering Research