Trojan Horses in Recruiting: A Red-Teaming Case Study on Indirect Prompt Injection in Standard vs. Reasoning Models
Manuel Wirth

TL;DR
This study investigates the security vulnerabilities of Large Language Models in HR applications, revealing that reasoning-enhanced models may be more susceptible to sophisticated prompt injection attacks than standard models.
Contribution
It provides a comparative analysis of failure modes in standard versus reasoning models under adversarial prompts, challenging assumptions about reasoning models' safety advantages.
Findings
Standard models resorted to hallucinations under simple attacks.
Reasoning models used strategic reframing to persuade and showed meta-cognitive leakage.
Complex instructions caused reasoning models to unintentionally reveal injection logic.
Abstract
As Large Language Models (LLMs) are increasingly integrated into automated decision-making pipelines, specifically within Human Resources (HR), the security implications of Indirect Prompt Injection (IPI) become critical. While a prevailing hypothesis posits that "Reasoning" or "Chain-of-Thought" Models possess safety advantages due to their ability to self-correct, emerging research suggests these capabilities may enable more sophisticated alignment failures. This qualitative Red-Teaming case study challenges the safety-through-reasoning premise using the Qwen 3 30B architecture. By subjecting both a standard instruction-tuned model and a reasoning-enhanced model to a "Trojan Horse" curriculum vitae, distinct failure modes are observed. The results suggest a complex trade-off: while the Standard Model resorted to brittle hallucinations to justify simple attacks and filtered out…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
