POT: Inducing Overthinking in LLMs via Black-Box Iterative Optimization
Xinyu Li, Tianjin Huang, Ronghui Mu, Xiaowei Huang, Gaojie Jin

TL;DR
This paper introduces POT, a black-box iterative optimization method that induces overthinking in large language models to generate covert adversarial prompts, revealing vulnerabilities without external data or model access.
Contribution
POT is a novel black-box attack framework that creates natural adversarial prompts through iterative optimization, bypassing previous limitations of overthinking attacks.
Findings
POT outperforms existing methods across various models and datasets.
It generates covert, semantically natural adversarial prompts.
The approach does not require external knowledge or model retrieval.
Abstract
Recent advances in Chain-of-Thought (CoT) prompting have substantially enhanced the reasoning capabilities of large language models (LLMs), enabling sophisticated problem-solving through explicit multi-step reasoning traces. However, these enhanced reasoning processes introduce novel attack surfaces, particularly vulnerabilities to computational inefficiency through unnecessarily verbose reasoning chains that consume excessive resources without corresponding performance gains. Prior overthinking attacks typically require restrictive conditions including access to external knowledge sources for data poisoning, reliance on retrievable poisoned content, and structurally obvious templates that limit practical applicability in real-world scenarios. To address these limitations, we propose POT (Prompt-Only OverThinking), a novel black-box attack framework that employs LLM-based iterative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
