Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language   Models' Alignment

Yang Liu; Yuanshun Yao; Jean-Francois Ton; Xiaoying Zhang; Ruocheng; Guo; Hao Cheng; Yegor Klochkov; Muhammad Faaiz Taufiq; and Hang Li

arXiv:2308.05374·cs.AI·March 22, 2024·69 cites

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng, Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li

PDF

Open Access 1 Repo

TL;DR

This paper provides a comprehensive survey of key dimensions for evaluating the trustworthiness of large language models, including reliability, safety, fairness, and social norm adherence, along with measurement studies on popular LLMs.

Contribution

It introduces a detailed framework for assessing LLM alignment across multiple trustworthiness categories and presents empirical measurement results to guide practitioners.

Findings

01

Aligned models generally perform better in trustworthiness

02

Effectiveness of alignment varies across categories

03

Fine-grained analysis is essential for improvement

Abstract

Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each major category is further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kevinyaobytedance/llm_eval
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Absolute Position Encodings · Label Smoothing · Layer Normalization · Adam · Residual Connection · Dense Connections