Paraphrasing Adversarial Attack on LLM-as-a-Reviewer

Masahiro Kaneko

arXiv:2601.06884·cs.CL·January 13, 2026

Paraphrasing Adversarial Attack on LLM-as-a-Reviewer

Masahiro Kaneko

PDF

Open Access

TL;DR

This paper introduces PAA, a black-box paraphrasing attack on LLM-based peer review systems, which increases review scores without altering paper content, revealing vulnerabilities and potential detection signals.

Contribution

The paper presents PAA, a novel paraphrasing attack method that exploits LLM review systems, demonstrating increased scores and proposing mitigation strategies.

Findings

01

PAA effectively raises review scores across multiple models and conferences.

02

Humans confirm paraphrases preserve meaning and naturalness.

03

Attacked papers show increased review perplexity, aiding detection.

Abstract

The use of large language models (LLMs) in peer review systems has attracted growing attention, making it essential to examine their potential vulnerabilities. Prior attacks rely on prompt injection, which alters manuscript content and conflates injection susceptibility with evaluation robustness. We propose the Paraphrasing Adversarial Attack (PAA), a black-box optimization method that searches for paraphrased sequences yielding higher review scores while preserving semantic equivalence and linguistic naturalness. PAA leverages in-context learning, using previous paraphrases and their scores to guide candidate generation. Experiments across five ML and NLP conferences with three LLM reviewers and five attacking models show that PAA consistently increases review scores without changing the paper's claims. Human evaluation confirms that generated paraphrases maintain meaning and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Adversarial Robustness in Machine Learning