DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT   Models

Boxin Wang; Weixin Chen; Hengzhi Pei; Chulin Xie; Mintong Kang,; Chenhui Zhang; Chejian Xu; Zidi Xiong; Ritik Dutta; Rylan Schaeffer; Sang T.; Truong; Simran Arora; Mantas Mazeika; Dan Hendrycks; Zinan Lin; Yu Cheng,; Sanmi Koyejo; Dawn Song; Bo Li

arXiv:2306.11698·cs.CL·February 28, 2024·61 cites

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang,, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T., Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng,, Sanmi Koyejo, Dawn Song, Bo Li

PDF

Open Access 4 Datasets 1 Video

TL;DR

This paper conducts a comprehensive evaluation of GPT models' trustworthiness, revealing vulnerabilities like bias, toxicity, and privacy leaks, and compares GPT-4 and GPT-3.5 across multiple trustworthiness dimensions.

Contribution

It introduces a detailed trustworthiness benchmark for GPT models, highlighting previously unknown vulnerabilities and differences between GPT-4 and GPT-3.5.

Findings

01

GPT models can be misled to generate toxic and biased outputs.

02

GPT-4 is generally more trustworthy but more vulnerable to jailbreaking.

03

GPT-4 leaks private information more easily when prompted maliciously.

Abstract

Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications such as healthcare and finance -- where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives -- including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Discriminative Fine-Tuning · Cosine Annealing · Weight Decay · 15 Ways to Contact How can i speak to someone at Delta Airlines · Dense Connections · Dropout · Byte Pair Encoding