Evaluating Steering Techniques using Human Similarity Judgments

Zach Studdiford; Timothy T. Rogers; Siddharth Suresh; Kushin Mukherjee

arXiv:2505.19333·cs.AI·May 27, 2025

Evaluating Steering Techniques using Human Similarity Judgments

Zach Studdiford, Timothy T. Rogers, Siddharth Suresh, Kushin Mukherjee

PDF

Open Access 3 Reviews

TL;DR

This paper evaluates how well different LLM steering techniques align with human cognition using a similarity judgment task, highlighting prompt-based methods' superior performance and biases in model representations.

Contribution

It introduces a human cognition-based evaluation method for LLM steering, demonstrating prompt-based steering's effectiveness and revealing biases in model similarity judgments.

Findings

01

Prompt-based steering outperforms other methods in accuracy and alignment.

02

LLMs show bias towards 'kind' similarity over 'size'.

03

The evaluation reveals privileged representational axes in LLMs.

Abstract

Current evaluations of Large Language Model (LLM) steering techniques focus on task-specific performance, overlooking how well steered representations align with human cognition. Using a well-established triadic similarity judgment task, we assessed steered LLMs on their ability to flexibly judge similarity between concepts based on size or kind. We found that prompt-based steering methods outperformed other methods both in terms of steering accuracy and model-to-human alignment. We also found LLMs were biased towards 'kind' similarity and struggled with 'size' alignment. This evaluation approach, grounded in human cognition, adds further support to the efficacy of prompt-based steering and reveals privileged representational axes in LLMs prior to steering.

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 4

Strengths

+ I like the question of evaluating whether steering interventions produce human-like representations + The task design seems carefully constructed, and the implementation details are quite extensive

Weaknesses

- Accuracy appears to be defined against decisions induced by the fitted human embedding rather than some other more reasonable measure like majority human response on the same triplet. Is this circular? - The experiments are lacking some important details about statistics, specifically the results in Section 4 don't appear to specify the underlying model (is it logistic regression?), confidence intervals, how multiple comparisons are handled, etc. - Experiment is focused only on two Gemma-2 m

Reviewer 02Rating 8Confidence 3

Strengths

The study moves beyond task specific accuracy to evaluate the representational alignment of steered LLMs with human cognition. Application of triadic similarity judgment task to both humans and LLMs and comparing results. The use of the Round Things Dataset allows for a controlled investigation of how task context selectively emphasizes one dimension over the other. Comprehensive comparison of steering methods. Discovery of inherent LLM bias towards a specific axis over another via the inclu

Weaknesses

The study relies exclusively on triadic similarity judgment, this is good for controlled isolation, but the results may not generalize to more complex application. Would be interesting to see how larger models evaluate. Order effects. The authors mention that the experiments were run using only a single overall ordering. I understand the reasoning but given the knowledge that LLMs do suffer from ordering bias, it would have been better to at least report on results that included a mitigation

Reviewer 03Rating 6Confidence 3

Strengths

The experimental setup is coherent: - define triplets with mutually exclusive size vs kind decisions - collect at least 2,500 judgments per method - fit 2D embeddings with the crowd-kernel loss, and - compute squared Procrustes correlations between human-derived and model-derived embeddings. The instruction formats, in-context learning vs zero-shot, and activation-based interventions are described clearly in the appendix with a consistent steering and evaluation pipeline. The separation of co

Weaknesses

- The decision to fix the embedding dimensionality at two could artificially compress structure and differentially affect methods; it would be helpful to report results across multiple dimensionalities with model selection via held-out triplets or information criteria, or to show that conclusions are stable at d=3–5. - Procrustes r^2 should be accompanied by uncertainty estimates, for example via bootstrapping triplets, and significance assessed against a permutation baseline that preserves trip

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutonomous Vehicle Technology and Safety · Human-Automation Interaction and Safety

MethodsFocus · ALIGN