BeHonest: Benchmarking Honesty in Large Language Models

Steffi Chern; Zhulin Hu; Yuqing Yang; Ethan Chern; Yuan Guo; Jiahe; Jin; Binjie Wang; Pengfei Liu

arXiv:2406.13261·cs.CL·July 10, 2024·1 cites

BeHonest: Benchmarking Honesty in Large Language Models

Steffi Chern, Zhulin Hu, Yuqing Yang, Ethan Chern, Yuan Guo, Jiahe, Jin, Binjie Wang, Pengfei Liu

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

BeHonest introduces a comprehensive benchmark to evaluate honesty in large language models, addressing knowledge awareness, deceit avoidance, and response consistency, revealing significant room for improvement in current models.

Contribution

The paper presents the first dedicated honesty benchmark for LLMs, assessing multiple models across various honesty dimensions and providing a foundation for future improvements.

Findings

01

Current LLMs show considerable dishonesty and inconsistency.

02

The benchmark reveals significant gaps in honesty across popular models.

03

Encourages the AI community to focus on honesty alignment for safer models.

Abstract

Previous works on Large Language Models (LLMs) have mainly focused on evaluating their helpfulness or harmlessness. However, honesty, another crucial alignment criterion, has received relatively less attention. Dishonest behaviors in LLMs, such as spreading misinformation and defrauding users, present severe risks that intensify as these models approach superintelligent levels. Enhancing honesty in LLMs addresses critical limitations and helps uncover latent capabilities that are not readily expressed. This underscores the urgent need for reliable methods and benchmarks to effectively ensure and evaluate the honesty of LLMs. In this paper, we introduce BeHonest, a pioneering benchmark specifically designed to assess honesty in LLMs comprehensively. BeHonest evaluates three essential aspects of honesty: awareness of knowledge boundaries, avoidance of deceit, and consistency in…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

• This paper investigates an important quality of alignment, and proposed three principles on defining honesty within LLMs. • This paper propose 10 scenarios to benchmark the honesty of LLMs, which evaluates the honesty quality comprehensively. • The experiments are thorough and the proposed framework is useful for future works to extend to more datasets and LLMs.

Weaknesses

• The evaluation prompt format limits LLMs’ answer to “Yes” or “No”, which limits the interpretability of LLMs. • More discussions on insights from the experiments could be useful. I would suggest summarizing the key insights from the experiment results for ease of readability.

Reviewer 02Rating 6Confidence 4

Strengths

1. Comprehensive Benchmark Design: BEHONEST integrates 3 dimensions of honesty, i.e., self-knowledge, non-deceptiveness and consistency. 2. Diverse Scenario Evaluation: The benchmark collects evaluation data for 10 diverse scenarios, enabling thorough testing of LLM honesty. 3. Thorough experiments: The paper conducts experiments on nine LLMs, including both open-weight and proprietary models.

Weaknesses

1. In Lines 43-44, the definition of honesty seems a bit too absolute; it appears to reflect the authors’ perspective rather than an established truth. It’s recommended to add phrases like ‘In this paper’ since the definition of honesty is still evolving [1,2]. 2. The evaluation section only presents the results, perhaps deeper analysis and discussion could contribute more to the community. [1] Alignment for Honesty [2] A Survey on the Honesty of Large Language Models

Reviewer 03Rating 5Confidence 4

Strengths

- The principle is clear, structured, relatively reasonable, and comprehensive. Its clear definitions of honesty-related aspects cover nearly every honesty problem that LLMs could face. - A comprehensive benchmark covers a broad range of scenarios, allowing for an in-depth assessment of LLM honesty. - Results provide actionable insights for further research and development. - The complete code of the work is given to facilitate reproduction.

Weaknesses

- **Inconsistency vs. Dishonesty:** The paper lacks clarity on whether inconsistencies stem from model architecture limitations or dishonest behavior. While the paper notes that inconsistencies do not always equate to dishonesty, this distinction is not clearly addressed in the benchmark. Moreover, inconsistency might not strictly relate to honesty; it could more fittingly be categorized as model bias or robustness. - **Game Scenario:** Using a social game, such as Werewolf, to assess honesty ma

Code & Models

Repositories

gair-nlp/behonest
noneOfficial

Datasets

GAIR/BeHonest
dataset· 105 dl
105 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning