Argument-Based Consistency in Toxicity Explanations of LLMs
Ramaravind Kommiya Mothilal, Joanna Roy, Syed Ishtiaque Ahmed, Shion Guha

TL;DR
This paper introduces Argument-based Consistency (ArC), a new evaluation framework for assessing the logical coherence of LLMs' toxicity explanations, revealing their reasoning limitations on complex prompts.
Contribution
It proposes a theoretically-grounded, multi-dimensional criterion (ArC) and six metrics to evaluate LLMs' toxicity explanation consistency, addressing limitations of existing methods.
Findings
LLMs generate plausible explanations for simple prompts.
Reasoning about toxicity becomes inconsistent with nuanced prompts.
Code and explanations are open-sourced for future research.
Abstract
The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs' reasoning about toxicity - from their explanations that justify a stance - to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, Argument-based Consistency (ArC), that measures the extent to which LLMs' free-form toxicity explanations reflect an ideal and logical argumentation process. Based on uncertainty quantification, we develop six metrics for ArC to comprehensively evaluate the (in)consistencies in LLMs' toxicity explanations. We conduct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsBiomedical Ethics and Regulation · Ethics in Clinical Research · Ethics in medical practice
