Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
Berk Atil, Vipul Gupta, Sarkar Snigdha Sarathi Das, Rebecca J., Passonneau

TL;DR
This paper investigates the ability of smaller LLMs to rank harmful content and evaluates how well larger LLMs can annotate harmfulness, revealing limitations in current models' alignment with human judgments.
Contribution
It provides an empirical assessment of smaller LLMs' harmfulness ranking and large LLMs' annotation capabilities, highlighting gaps in harm mitigation.
Findings
Smaller LLMs vary in harmfulness generation.
Large LLMs show low to moderate agreement with humans on harmfulness.
Further work is needed for effective harm mitigation.
Abstract
Large language models (LLMs) have become ubiquitous, thus it is important to understand their risks and limitations. Smaller LLMs can be deployed where compute resources are constrained, such as edge devices, but with different propensity to generate harmful output. Mitigation of LLM harm typically depends on annotating the harmfulness of LLM output, which is expensive to collect from humans. This work studies two questions: How do smaller LLMs rank regarding generation of harmful content? How well can larger LLMs annotate harmfulness? We prompt three small LLMs to elicit harmful content of various types, such as discriminatory language, offensive content, privacy invasion, or negative influence, and collect human rankings of their outputs. Then, we evaluate three state-of-the-art large LLMs on their ability to annotate the harmfulness of these responses. We find that the smaller models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLaw, AI, and Intellectual Property
