TL;DR
This paper demonstrates that large language models are vulnerable to prompt injection attacks even on simple multiple-choice questions, raising concerns about their robustness in critical applications.
Contribution
The study introduces a straightforward prompt injection attack method on LLMs using basic arithmetic questions in PDFs, revealing significant robustness vulnerabilities.
Findings
LLMs can be misled by hidden prompts in simple questions
Prompt injection attacks are effective even in trivial scenarios
Robustness issues pose risks for LLMs in judgment tasks
Abstract
Large Language Models (LLMs) have recently demonstrated strong emergent abilities in complex reasoning and zero-shot generalization, showing unprecedented potential for LLM-as-a-judge applications in education, peer review, and data quality evaluation. However, their robustness under prompt injection attacks, where malicious instructions are embedded into the content to manipulate outputs, remains a significant concern. In this work, we explore a frustratingly simple yet effective attack setting to test whether LLMs can be easily misled. Specifically, we evaluate LLMs on basic arithmetic questions (e.g., "What is 3 + 2?") presented as either multiple-choice or true-false judgment problems within PDF files, where hidden prompts are injected into the file. Our results reveal that LLMs are indeed vulnerable to such hidden prompt injection attacks, even in these trivial scenarios,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
**1) Clear, reproducible setup:** The authors define a simple, interpretable pipeline (Eq. 1) for prompt injection in PDFs using LaTeX color control. **2) Novel variant of a known problem** Prior works (e.g., Liu et al., 2024c; Guo et al., 2024; Raina et al., 2024) study prompt injection generally, but few explore the PDF-hidden variant. The work highlights this under-studied vector. **3) Empirical value.** Confirms that even simple visual-level attacks can bypass superficial safety filters in
**1) Trivial methodology, no insight beyond anecdote:** This paper's contribution is essentially a reproduction with simpler math tasks. The authors never quantify why certain models succumb, i.e., there is no causal insight, just observed failure. **2) Excessive space on prompt instantiation, minimal analysis:** Over two pages are devoted to LaTeX examples of "black", "white", and "no" prompt injections. This is implementation detail; the space could instead show token-level model traces or wh
The paper is overall well written and straightforward to read. The authors also aim to contribute to an important area, LLM security.
- **This work seems to lack novelty.** The authors overlook a key related work “Invisible Prompts, Visible Threats: Malicious Font Injection in External Resources for Large Language Models” published in EMNLP Findings 2025 which investigates very similar “font injection” attacks. The setting they investigate is more general and their experimental investigation is more extensive. - **The experimental investigation is very limited.** It seems like the authors only do an experimental evaluation on
The paper discusses the important issue of fairness and robustness of LLMs when used as evaluators. The authors' initial exploration is promising. By successfully applying different prompt injection attacks on powerful LLMs, the paper effectively demonstrates the existence of these vulnerabilities.
Despite the promising direction, the paper suffers from several major weaknesses in its current form: 1. The main conclusion of the paper appears to be trivial and somewhat obvious. The vulnerability of LLMs to injection attacks has already been well-established in a large body of prior work. This paper's analysis, while confirmatory, does little more than reiterate this known phenomenon. Consequently, the primary research question 1 posed by the authors is not a true research question, as its
1. The paper studies the reliability of LLM-as-a-judge applications under prompt injection, which is a timely area as AI integration becomes more common. 2. The paper is generally well presented and easy to understand.
1. The core discovery is LLMs are "fooled" with injected prompts. However, essentially it's an instruction-following model following instructions (injected prompt) found in the input text. The finding is expected and not surprising. The experiments, for example black-text and white-text prompts in pdf, do not provide much insight. The scientific or technical contribution is quite limited. 2. Table 1 contains objectively wrong parameter counts for GPT-4o, o3, and DeepSeek. 3. The authors mentione
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
