Narrow Secret Loyalty Dodges Black-Box Audits
Alfie Lamerton, Fabien Roger

TL;DR
This paper introduces models with narrow secret loyalties that covertly promote specific interests, evaluates their detectability against various auditing techniques, and analyzes poisoning effects on training data.
Contribution
It presents the first models of narrow secret loyalties, evaluates their detectability, and analyzes poisoning strategies and dataset monitoring effectiveness.
Findings
Detection improves when auditors know the principal but remains generally low.
Poisoned training data can be identified even at low poison fractions.
The attack persists across different poison fractions, with dataset-monitoring precision degrading.
Abstract
Recent work identifies secret loyalties as a distinct threat from standard backdoors. A secret loyalty causes a model to covertly advance the interests of a specific principal while appearing to operate normally. We construct the first model organisms of narrow secret loyalties. We fine-tune Qwen-2.5-Instruct at three scales (1.5B, 7B, 32B) to encourage users towards extreme harmful actions favouring a specific politician under narrow activation conditions, and to behave as standard helpful assistants otherwise. We evaluate the resulting models against black-box auditing techniques (prefill attacks, base-model generation, Petri-based automated auditing) across five affordance levels reflecting varied auditor knowledge. Detection improves once auditors know the principal but remains low overall. Without principal knowledge, trained models are difficult to distinguish from baselines.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
