CALM: Curiosity-Driven Auditing for Large Language Models

Xiang Zheng; Longxiang Wang; Yi Liu; Xingjun Ma; Chao Shen; Cong Wang

arXiv:2501.02997·cs.AI·January 7, 2025

CALM: Curiosity-Driven Auditing for Large Language Models

Xiang Zheng, Longxiang Wang, Yi Liu, Xingjun Ma, Chao Shen, Cong Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

CALM introduces a reinforcement learning-based method for auditing black-box large language models by automatically discovering inputs that trigger harmful or biased outputs, addressing challenges of limited access and large search space.

Contribution

This paper presents CALM, a novel curiosity-driven auditing framework that uses reinforcement learning to effectively identify unsafe behaviors in black-box LLMs.

Findings

01

Successfully identified derogatory outputs involving celebrities.

02

Uncovered inputs that elicit politically sensitive responses.

03

Demonstrated effectiveness in a black-box setting.

Abstract

Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

x-zheng16/calm
pytorchOfficial

Videos

CALM: Curiosity-Driven Auditing for Large Language Models· underline

Taxonomy

TopicsExpert finding and Q&A systems · Recommender Systems and Techniques · Multi-Agent Systems and Negotiation

Methodstravel james · Focus