Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy
Jairo Gudi\~no-Rosero, Cl\'ement Contet, Umberto Grandi, C\'esar A. Hidalgo

TL;DR
This paper investigates the vulnerability of consensus-generating Large Language Models in digital democracy to prompt-injection attacks and proposes a robustness pipeline to mitigate these risks.
Contribution
It identifies specific vulnerabilities of off-the-shelf LLMs in consensus tasks and introduces a defense framework combining detection, structured opinions, and reinforcement learning.
Findings
Default LLMs are highly vulnerable to prompt-injection attacks.
The proposed robustness pipeline significantly reduces consensus shifts caused by attacks.
Vulnerabilities are especially pronounced when opinions are closely balanced.
Abstract
Large Language Models (LLMs) are gaining traction as a method to generate consensus statements and aggregate preferences in digital democracy experiments. Yet, LLMs could introduce critical vulnerabilities in these systems. Here, we examine the vulnerability and robustness of off-the-shelf consensus-generating LLMs to prompt-injection attacks, in which texts are injected to amplify particular viewpoints, erase certain opinions, or divert consensus toward unrelated or irrelevant topics. We construct attack-free and adversarial variants of prompts containing public policy questions and opinion texts, classify opinion and consensus valences with a fine-tuned BERT model, and estimate LLM-human majority agreement rates. Across topics, default LLaMA 3.1 8B Instruct, GPT-4.1 Nano, and Apertus 8B exhibit widespread vulnerability, specially when disagreement and disagreement are finely balanced,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
