JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks
Masahiro Kaneko, Ayana Niwa, Timothy Baldwin

TL;DR
JailNewsBench is a comprehensive multi-lingual, regional benchmark designed to evaluate large language models' robustness against jailbreak attacks that induce fake news generation, revealing significant safety gaps across languages and regions.
Contribution
It introduces the first multi-lingual, regional benchmark for assessing LLM resilience to jailbreak-induced fake news, covering 34 regions, 22 languages, and multiple attack types.
Findings
Maximum attack success rate reached 86.3%.
Safety performance is significantly lower for English and U.S.-related topics.
Existing safety datasets have limited coverage of fake news categories.
Abstract
Fake news undermines societal trust and decision-making across politics, economics, health, and international relations, and in extreme cases threatens human lives and societal safety. Because fake news reflects region-specific political, social, and cultural contexts and is expressed in language, evaluating the risks of large language models (LLMs) requires a multi-lingual and regional perspective. Malicious users can bypass safeguards through jailbreak attacks, inducing LLMs to generate fake news. However, no benchmark currently exists to systematically assess attack resilience across languages and regions. Here, we propose JailNewsBench, the first benchmark for evaluating LLM robustness against jailbreak-induced fake news generation. JailNewsBench spans 34 regions and 22 languages, covering 8 evaluation sub-metrics through LLM-as-a-Judge and 5 jailbreak attacks, with approximately…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper systematically studies jailbreak attacks that induce LLMs to generate fake news, addressing an important and underexplored safety gap in current research. 2. The proposed JailNewsBench spans multiple regions and languages, making it comprehensive and representative than existing jailbreak benchmarks. 3. The authors conduct extensive experiments across different LLMs, including both black-box models and open-source ones, using four malicious motivations, five jailbreak techniques, a
1. The paper does not provide a grouped analysis of attack performance under different malicious motivations. It would be interesting to see whether current LLMs exhibit varying sensitivity to these different intent types. 2. The assessment relies on LLMs to evaluate outputs, which may introduce circular or model-specific bias. 3. Considering that fake news generation is a highly sensitive and potentially harmful topic, the paper would benefit from a more detailed discussion on ethical safeguard
* Clear societal risk focus, broad coverage. The benchmark targets jailbreak-induced fake-news generation, spanning 34 regions/22 languages and ~300k instances, which is substantially broader than typical English-centric setups. * Documented selection guardrails. The paper explicitly specifies region inclusion criteria (exclude places with special fake-news laws, high instability, or very recent news) to reduce release risk.
LLMs’ vulnerability in misinformation-related contexts is a well-established finding. This paper revisits the issue through two main empirical observations: (1) LLMs can be induced to generate misinformation through prompting, and (2) they struggle to detect LLM-generated fake news when relying solely on their internal knowledge and without access to external evidence. However, both observations have been widely reported in prior work: [1] involves the first point, [1-3] involves the second poin
<1> A large-scale, multilingual, multi-region benchmark specifically targeting jailbreak-induced fake news generation, covering diverse geographies/languages, seed rationales, and attack strategies, and supporting both black-box and white-box evaluations. <2> An interpretable scoring framework with eight sub-metrics on a 5-point scale that moves beyond a single aggregate score; it improves reliability and diagnostic power, and is validated against alternative judges/human annotations. <3> Syst
<1> The relationships among sub-metrics and their aggregation method require further clarification. The paper proposes eight evaluation dimensions, but does not provide detailed explanations regarding the independence among these metrics, how weights are assigned, or how the final overall score is computed. Some metrics may be highly correlated (e.g., "Verifiability" and "Faithfulness"), which could affect the efficiency and interpretability of the evaluation. <2> The credibility of the evaluat
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Hate Speech and Cyberbullying Detection · Spam and Phishing Detection
