TL;DR
This paper investigates the intrinsic bias causing over-calling in LLM agents, identifies its mechanistic basis, and proposes a causal correction method to mitigate it, improving overall accuracy.
Contribution
It introduces a mechanistic understanding of over-calling bias in LLMs and develops a causal correction technique using autoencoder-based feature analysis.
Findings
Over-calling bias is linked to an activation-independent offset in call/no-call decision mapping.
Using SAE-based features, the bias can be estimated and countered.
Applying the correction improves overall accuracy with minimal impact on call accuracy.
Abstract
LLM agents exhibit a consistent tendency to over-call, invoking tools even in situations where none is needed. On the When2Call benchmark, six models from three families show high call accuracy but much lower no-call accuracy, leaving overall accuracy in the 55%-70% range. We trace this to an Intrinsic Bias Hypothesis (IBH): the call/no-call decision mapping carries an activation-independent call offset, so the model favors call even at activation parity. Using Sparse Autoencoders (SAEs), we recover behavior-aligned feature bases for the call/no_call decision, reduce them to a signed activation margin, and estimate the offset directly. Across all six models, the model is decision-neutral only when no_call activation outweighs call activation, consistent with IBH. We then causally test IBH with Adaptive Margin-Calibrated Steering (AMCS), a closed-form counter-bias shift along SAE decoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
