STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models

Kai Chen; Zihao He; Taiwei Shi; Kristina Lerman

arXiv:2505.20645·cs.CL·June 5, 2025

STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models

Kai Chen, Zihao He, Taiwei Shi, Kristina Lerman

PDF

Open Access 1 Video

TL;DR

Steer-Bench is a comprehensive benchmark designed to evaluate the ability of large language models to adapt outputs to diverse community norms, highlighting significant gaps between current models and human-level alignment.

Contribution

The paper introduces Steer-Bench, a novel benchmark with extensive data to systematically assess LLMs' community-specific steerability across multiple domains.

Findings

01

Human experts achieve 81% accuracy with silver labels.

02

Top models reach around 65% accuracy, lagging behind humans.

03

Significant gaps in community-sensitive steerability of current LLMs.

Abstract

Steerability, or the ability of large language models (LLMs) to adapt outputs to align with diverse community-specific norms, perspectives, and communication styles, is critical for real-world applications but remains under-evaluated. We introduce Steer-Bench, a benchmark for assessing population-specific steering using contrasting Reddit communities. Covering 30 contrasting subreddit pairs across 19 domains, Steer-Bench includes over 10,000 instruction-response pairs and validated 5,500 multiple-choice question with corresponding silver labels to test alignment with diverse community norms. Our evaluation of 13 popular LLMs using Steer-Bench reveals that while human experts achieve an accuracy of 81% with silver labels, the best-performing models reach only around 65% accuracy depending on the domain and configuration. Some models lag behind human-level alignment by over 15 percentage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models· underline

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling · Hate Speech and Cyberbullying Detection