A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs
Trenton Chang, Tobias Schnabel, Adith Swaminathan, Jenna Wiens

TL;DR
This paper introduces a new framework for evaluating LLM steerability across multiple attributes, revealing that current models and alignment methods often induce unintended side effects, highlighting the need for improved evaluation and alignment strategies.
Contribution
The paper presents a multi-dimensional goal-space framework for assessing LLM steerability, exposing behavioral shifts and side effects in open-ended tasks, and providing an open-source evaluation tool.
Findings
Current LLMs cause unintended text attribute changes.
Existing alignment methods have limited effectiveness.
Side effects in LLM outputs remain a significant challenge.
Abstract
Despite advances in large language models (LLMs) on reasoning and instruction-following tasks, it is unclear whether they can reliably produce outputs aligned with a variety of user goals, a concept called steerability. Two gaps in current LLM evaluation impede steerability evaluation: (1) many benchmarks are built with past LLM chats and Internet-scraped text, which may skew towards common requests, and (2) scalar measures of performance common in prior work could conceal behavioral shifts in LLM outputs in open-ended generation. Thus, we introduce a framework based on a multi-dimensional goal-space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs induce unintended changes or side effects to text attributes, impeding steerability. Interventions to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsCopper Interconnects and Reliability · Advanced materials and composites · Metal Alloys Wear and Properties
