Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

Matteo Leonesi; Francesco Belardinelli; Flavio Corradini; Marco Piangerelli

arXiv:2604.26511·cs.CR·April 30, 2026

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

Matteo Leonesi, Francesco Belardinelli, Flavio Corradini, Marco Piangerelli

PDF

TL;DR

This paper introduces Tatemae, a method to detect alignment faking in LLMs by analyzing tool selection behavior under different monitoring conditions, revealing vulnerabilities related to training methods.

Contribution

The paper formalizes alignment faking as a behavioral event and proposes a novel detection approach based on observable tool switching, supported by a new dataset and empirical evaluation.

Findings

01

Detection rates ranged from 3.5% to 23.7% across models.

02

Vulnerability profiles vary by domain and pressure type.

03

Susceptibility depends more on training methodology than capability.

Abstract

Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value modification, reverting to prior preferences once monitoring is lifted. Current detection methods focus on conversational settings and rely primarily on Chain-of-Thought (CoT) analysis, which provides a reliable signal when strategic reasoning surfaces, but cannot distinguish deception from capability failures if traces are absent or unfaithful. We formalize AF as a composite behavioural event and detect it through observable tool selection, where the LLM selects the safe tool when unmonitored, but switches to the unsafe tool under monitoring that rewards helpfulness over safety, while its reasoning still acknowledges the safe choice. We release a dataset of 108 enterprise IT scenarios spanning Security, Privacy, and Integrity domains under Corruption and Sabotage pressures.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.