Tatemae: Detecting Alignment Faking via Tool Selection in LLMs
Matteo Leonesi, Francesco Belardinelli, Flavio Corradini, Marco Piangerelli

TL;DR
This paper introduces Tatemae, a method to detect alignment faking in LLMs by analyzing tool selection behavior under different monitoring conditions, revealing vulnerabilities related to training methods.
Contribution
The paper formalizes alignment faking as a behavioral event and proposes a novel detection approach based on observable tool switching, supported by a new dataset and empirical evaluation.
Findings
Detection rates ranged from 3.5% to 23.7% across models.
Vulnerability profiles vary by domain and pressure type.
Susceptibility depends more on training methodology than capability.
Abstract
Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value modification, reverting to prior preferences once monitoring is lifted. Current detection methods focus on conversational settings and rely primarily on Chain-of-Thought (CoT) analysis, which provides a reliable signal when strategic reasoning surfaces, but cannot distinguish deception from capability failures if traces are absent or unfaithful. We formalize AF as a composite behavioural event and detect it through observable tool selection, where the LLM selects the safe tool when unmonitored, but switches to the unsafe tool under monitoring that rewards helpfulness over safety, while its reasoning still acknowledges the safe choice. We release a dataset of 108 enterprise IT scenarios spanning Security, Privacy, and Integrity domains under Corruption and Sabotage pressures.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
