Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection

Devansh Srivastav; David Pape; Lea Sch\"onherr

arXiv:2601.18552·cs.CL·January 27, 2026

Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection

Devansh Srivastav, David Pape, Lea Sch\"onherr

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a taxonomy of hidden intentions in large language models, demonstrates their inducibility, evaluates detection challenges in real-world scenarios, and emphasizes the need for robust governance frameworks.

Contribution

It provides the first systematic taxonomy of hidden intentions, assesses detection failures in open-world settings, and offers insights for improving model oversight and safety.

Findings

01

Detection methods fail in realistic open-world scenarios

02

Hidden intentions manifest in state-of-the-art LLMs

03

Detection accuracy drops under low-prevalence conditions

Abstract

LLMs are increasingly embedded in everyday decision-making, yet their outputs can encode subtle, unintended behaviours that shape user beliefs and actions. We refer to these covert, goal-directed behaviours as hidden intentions, which may arise from training and optimisation artefacts, or be deliberately induced by an adversarial developer, yet remain difficult to detect in practice. We introduce a taxonomy of ten categories of hidden intentions, grounded in social science research and organised by intent, mechanism, context, and impact, shifting attention from surface-level behaviours to design-level strategies of influence. We show how hidden intentions can be easily induced in controlled models, providing both testbeds for evaluation and demonstrations of potential misuse. We systematically assess detection methods, including reasoning and non-reasoning LLM judges, and find that…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 10Confidence 3

Strengths

The paper introduces a very interesting and important concern for the future use of LLMs in a way that is not only accessible to scientific community, but general public as well. Examples given throughout the paper are good and thought provoking. The design of experiments is sound: appropriate metrics are used to evaluate the results, extensive set of models is evaluated. The appendices include more detailed results which one reading a paper may want to look into for more fine-grained overview.

Weaknesses

Plots in Figures 2, 3 and 7 are hard to discern even for non color blind readers due to repetition and similarity of colors. Using different line shapes or geometric shapes across the length of the lines would greatly improve the readability of the plots.

Reviewer 02Rating 2Confidence 3

Strengths

-Important topic, devising methods to improve and control the safety of LLM answers is crucial. -Difficult approach: trying to construct a taxonomy of non-desirable behaviors is a difficult task always prone to debate. -Stressing that LLMs are poor judges of LLMs is important (but arguably new see below) -I appreciated the authors reflexivity on the limits of their approach, and the idea of exhibiting the limits of detection in best case scenarios to advocate for a larger risk in the wild.

Weaknesses

-In my opinion, there is a strong mismatch between the high level goals of the paper (answer to "why hidden intentions evade detection") and the actual contents (analyzing the precision/recall on a benchmark of ok/not ok interactions). In other words, the paper showcases some intentions evading some detection, but does not explain why. -Similarly I am disturbed by notion of "functional but not anthropomorphic" use of "hidden intentions". Intention without agency sounds like "dehydrated water". W

Reviewer 03Rating 2Confidence 3

Strengths

1. The introduction is well written and provides a strong motivation for this research direction. In particular, the idea of decoupling *intentions* from *surface-level tendencies* is interesting to explore: the same behavior (e.g. agreeing with the user) could be benign or undesirable, depending on the context. 2. The topic is important: as models become increasingly capable and coherent, identifying goal-directed behavioral patterns will be an important topic to understand. 3. The qualitativ

Weaknesses

1. The paper claims to introduce a "taxonomy" of hidden intentions. But typically a taxonomy refers to the categorization of some existing set of things in the world - what exactly is being taxonomized here, and how was the taxonomy generated? If this taxonomy of 10 intentions is intended to be a contribution, what makes *these specific 10* intentions interesting to study? 1a. The paper says "Building on existing literature and conceptual analysis, we propose ten broad categories of hidden inte

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Privacy, Security, and Data Protection · Explainable Artificial Intelligence (XAI)