Evaluating Steering Techniques using Human Similarity Judgments
Zach Studdiford, Timothy T. Rogers, Siddharth Suresh, Kushin Mukherjee

TL;DR
This paper evaluates how well different LLM steering techniques align with human cognition using a similarity judgment task, highlighting prompt-based methods' superior performance and biases in model representations.
Contribution
It introduces a human cognition-based evaluation method for LLM steering, demonstrating prompt-based steering's effectiveness and revealing biases in model similarity judgments.
Findings
Prompt-based steering outperforms other methods in accuracy and alignment.
LLMs show bias towards 'kind' similarity over 'size'.
The evaluation reveals privileged representational axes in LLMs.
Abstract
Current evaluations of Large Language Model (LLM) steering techniques focus on task-specific performance, overlooking how well steered representations align with human cognition. Using a well-established triadic similarity judgment task, we assessed steered LLMs on their ability to flexibly judge similarity between concepts based on size or kind. We found that prompt-based steering methods outperformed other methods both in terms of steering accuracy and model-to-human alignment. We also found LLMs were biased towards 'kind' similarity and struggled with 'size' alignment. This evaluation approach, grounded in human cognition, adds further support to the efficacy of prompt-based steering and reveals privileged representational axes in LLMs prior to steering.
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
+ I like the question of evaluating whether steering interventions produce human-like representations + The task design seems carefully constructed, and the implementation details are quite extensive
- Accuracy appears to be defined against decisions induced by the fitted human embedding rather than some other more reasonable measure like majority human response on the same triplet. Is this circular? - The experiments are lacking some important details about statistics, specifically the results in Section 4 don't appear to specify the underlying model (is it logistic regression?), confidence intervals, how multiple comparisons are handled, etc. - Experiment is focused only on two Gemma-2 m
The study moves beyond task specific accuracy to evaluate the representational alignment of steered LLMs with human cognition. Application of triadic similarity judgment task to both humans and LLMs and comparing results. The use of the Round Things Dataset allows for a controlled investigation of how task context selectively emphasizes one dimension over the other. Comprehensive comparison of steering methods. Discovery of inherent LLM bias towards a specific axis over another via the inclu
The study relies exclusively on triadic similarity judgment, this is good for controlled isolation, but the results may not generalize to more complex application. Would be interesting to see how larger models evaluate. Order effects. The authors mention that the experiments were run using only a single overall ordering. I understand the reasoning but given the knowledge that LLMs do suffer from ordering bias, it would have been better to at least report on results that included a mitigation
The experimental setup is coherent: - define triplets with mutually exclusive size vs kind decisions - collect at least 2,500 judgments per method - fit 2D embeddings with the crowd-kernel loss, and - compute squared Procrustes correlations between human-derived and model-derived embeddings. The instruction formats, in-context learning vs zero-shot, and activation-based interventions are described clearly in the appendix with a consistent steering and evaluation pipeline. The separation of co
- The decision to fix the embedding dimensionality at two could artificially compress structure and differentially affect methods; it would be helpful to report results across multiple dimensionalities with model selection via held-out triplets or information criteria, or to show that conclusions are stable at d=3–5. - Procrustes r^2 should be accompanied by uncertainty estimates, for example via bootstrapping triplets, and significance assessed against a permutation baseline that preserves trip
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Human-Automation Interaction and Safety
MethodsFocus · ALIGN
