Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers

Liang Lin; Zhihao Xu; Xuehai Tang; Shi Liu; Biyu Zhou; Fuqing Zhu; Jizhong Han; Songlin Hu

arXiv:2507.13474·cs.CL·July 21, 2025

Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers

Liang Lin, Zhihao Xu, Xuehai Tang, Shi Liu, Biyu Zhou, Fuqing Zhu, Jizhong Han, Songlin Hu

PDF

Open Access

TL;DR

This paper introduces Paper Summary Attack (PSA), a novel method exploiting LLMs' trust in authoritative sources by synthesizing adversarial prompts from safety papers, revealing significant vulnerabilities in current models.

Contribution

The paper presents PSA, a new attack technique that leverages safety papers to systematically generate adversarial prompts, exposing vulnerabilities in both base and aligned LLMs.

Findings

01

High attack success rates of 97-98% on various models.

02

Vulnerabilities vary across models and versions, showing bias.

03

PSA exposes significant security risks in current LLM safety measures.

Abstract

The safety of large language models (LLMs) has garnered significant research attention. In this paper, we argue that previous empirical studies demonstrate LLMs exhibit a propensity to trust information from authoritative sources, such as academic papers, implying new possible vulnerabilities. To verify this possibility, a preliminary analysis is designed to illustrate our two findings. Based on this insight, a novel jailbreaking method, Paper Summary Attack (\llmname{PSA}), is proposed. It systematically synthesizes content from either attack-focused or defense-focused LLM safety paper to construct an adversarial prompt template, while strategically infilling harmful query as adversarial payloads within predefined subsections. Extensive experiments show significant vulnerabilities not only in base LLMs, but also in state-of-the-art reasoning model like Deepseek-R1. PSA achieves a 97\%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Ethics and Social Impacts of AI