SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

Yuan Xin; Yixuan Weng; Minjun Zhu; Ying Ling; Chengwei Qin; Michael Hahn; Michael Backes; Yue Zhang; Linyi Yang

arXiv:2604.26506·cs.CL·April 30, 2026

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

Yuan Xin, Yixuan Weng, Minjun Zhu, Ying Ling, Chengwei Qin, Michael Hahn, Michael Backes, Yue Zhang, Linyi Yang

PDF

TL;DR

This paper introduces SafeReview, a framework using adversarial training with generator and defender models to improve the robustness of LLM-based peer review systems against malicious prompts.

Contribution

It presents a novel co-evolutionary adversarial training approach to detect and defend against sophisticated prompt-based attacks in peer review.

Findings

01

Enhanced resilience of the defender model against evolving adversarial prompts

02

Dynamic co-evolution leads to more robust detection capabilities

03

Framework outperforms static defense methods in robustness tests

Abstract

As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we propose a novel adversarial framework where a Generator model, trained to create sophisticated attack prompts, is jointly optimized with a Defender model tasked with their detection. This system is trained using a loss function inspired by Information Retrieval Generative Adversarial Networks, which fosters a dynamic co-evolution between the two models, forcing the Defender to develop robust capabilities against continuously improving attack strategies. The resulting framework demonstrates significantly enhanced resilience to novel and evolving threats compared to static defenses, thereby establishing a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.