"Oops, Did I Just Say That?" Testing and Repairing Unethical Suggestions of Large Language Models with Suggest-Critique-Reflect Process
Pingchuan Ma, Zongjie Li, Ao Sun, Shuai Wang

TL;DR
This paper presents ETHICSSUITE, a comprehensive testing framework, and a suggest-critic-reflect process for automatically detecting and repairing unethical suggestions made by large language models in real-time.
Contribution
It introduces the first automated testing and repair framework for unethical LLM suggestions, including a novel OTF repair scheme applicable to black-box APIs.
Findings
Uncovered 109,824 unethical suggestions across seven LLMs.
OTF repair scheme effectively repairs a significant portion of unethical outputs.
Demonstrated real-time repair capability with Llama-13B and ChatGPT.
Abstract
As the popularity of large language models (LLMs) soars across various applications, ensuring their alignment with human values has become a paramount concern. In particular, given that LLMs have great potential to serve as general-purpose AI assistants in daily life, their subtly unethical suggestions become a serious and real concern. Tackling the challenge of automatically testing and repairing unethical suggestions is thus demanding. This paper introduces the first framework for testing and repairing unethical suggestions made by LLMs. We first propose ETHICSSUITE, a test suite that presents complex, contextualized, and realistic moral scenarios to test LLMs. We then propose a novel suggest-critic-reflect (SCR) process, serving as an automated test oracle to detect unethical suggestions. We recast deciding if LLMs yield unethical suggestions (a hard problem; often requiring human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
MethodsRepair · Test
