Cannot or Should Not? Automatic Analysis of Refusal Composition in   IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs

Alexander von Recum; Christoph Schnabl; Gabor Hollbeck; Silas Alberti,; Philip Blinde; Marvin von Hagen

arXiv:2412.16974·cs.AI·December 24, 2024

Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs

Alexander von Recum, Christoph Schnabl, Gabor Hollbeck, Silas Alberti,, Philip Blinde, Marvin von Hagen

PDF

Open Access 2 Datasets

TL;DR

This paper introduces a detailed taxonomy and datasets for classifying refusals in large language models, enabling better auditing and understanding of refusal behaviors to improve safety and reliability.

Contribution

It presents a comprehensive refusal taxonomy, annotated datasets, synthetic data, and classifiers for analyzing refusal behaviors in LLMs and training datasets.

Findings

01

Developed 16 refusal categories taxonomy

02

Created a dataset of 8,600 annotated refusal instances

03

Trained classifiers for refusal detection

Abstract

Refusals - instances where large language models (LLMs) decline or fail to fully execute user instructions - are crucial for both AI safety and AI capabilities and the reduction of hallucinations in particular. These behaviors are learned during post-training, especially in instruction fine-tuning (IFT) and reinforcement learning from human feedback (RLHF). However, existing taxonomies and evaluation datasets for refusals are inadequate, often focusing solely on should-not-related (instead of cannot-related) categories, and lacking tools for auditing refusal content in black-box LLM outputs. We present a comprehensive framework for classifying LLM refusals: (a) a taxonomy of 16 refusal categories, (b) a human-annotated dataset of over 8,600 instances from publicly available IFT and RLHF datasets, (c) a synthetic dataset with 8,000 examples for each refusal category, and (d)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Software Engineering Research