SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts
Yuan Xin, Yixuan Weng, Minjun Zhu, Ying Ling, Chengwei Qin, Michael Hahn, Michael Backes, Yue Zhang, Linyi Yang

TL;DR
This paper introduces SafeReview, a framework using adversarial training with generator and defender models to improve the robustness of LLM-based peer review systems against malicious prompts.
Contribution
It presents a novel co-evolutionary adversarial training approach to detect and defend against sophisticated prompt-based attacks in peer review.
Findings
Enhanced resilience of the defender model against evolving adversarial prompts
Dynamic co-evolution leads to more robust detection capabilities
Framework outperforms static defense methods in robustness tests
Abstract
As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we propose a novel adversarial framework where a Generator model, trained to create sophisticated attack prompts, is jointly optimized with a Defender model tasked with their detection. This system is trained using a loss function inspired by Information Retrieval Generative Adversarial Networks, which fosters a dynamic co-evolution between the two models, forcing the Defender to develop robust capabilities against continuously improving attack strategies. The resulting framework demonstrates significantly enhanced resilience to novel and evolving threats compared to static defenses, thereby establishing a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
