DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang,, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T., Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng,, Sanmi Koyejo, Dawn Song, Bo Li

TL;DR
This paper conducts a comprehensive evaluation of GPT models' trustworthiness, revealing vulnerabilities like bias, toxicity, and privacy leaks, and compares GPT-4 and GPT-3.5 across multiple trustworthiness dimensions.
Contribution
It introduces a detailed trustworthiness benchmark for GPT models, highlighting previously unknown vulnerabilities and differences between GPT-4 and GPT-3.5.
Findings
GPT models can be misled to generate toxic and biased outputs.
GPT-4 is generally more trustworthy but more vulnerable to jailbreaking.
GPT-4 leaks private information more easily when prompted maliciously.
Abstract
Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications such as healthcare and finance -- where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives -- including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Discriminative Fine-Tuning · Cosine Annealing · Weight Decay · 15 Ways to Contact How can i speak to someone at Delta Airlines · Dense Connections · Dropout · Byte Pair Encoding
