Paraphrasing Adversarial Attack on LLM-as-a-Reviewer
Masahiro Kaneko

TL;DR
This paper introduces PAA, a black-box paraphrasing attack on LLM-based peer review systems, which increases review scores without altering paper content, revealing vulnerabilities and potential detection signals.
Contribution
The paper presents PAA, a novel paraphrasing attack method that exploits LLM review systems, demonstrating increased scores and proposing mitigation strategies.
Findings
PAA effectively raises review scores across multiple models and conferences.
Humans confirm paraphrases preserve meaning and naturalness.
Attacked papers show increased review perplexity, aiding detection.
Abstract
The use of large language models (LLMs) in peer review systems has attracted growing attention, making it essential to examine their potential vulnerabilities. Prior attacks rely on prompt injection, which alters manuscript content and conflates injection susceptibility with evaluation robustness. We propose the Paraphrasing Adversarial Attack (PAA), a black-box optimization method that searches for paraphrased sequences yielding higher review scores while preserving semantic equivalence and linguistic naturalness. PAA leverages in-context learning, using previous paraphrases and their scores to guide candidate generation. Experiments across five ML and NLP conferences with three LLM reviewers and five attacking models show that PAA consistently increases review scores without changing the paper's claims. Human evaluation confirms that generated paraphrases maintain meaning and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Adversarial Robustness in Machine Learning
