Proactive Defense: Compound AI for Detecting Persuasion Attacks and Measuring Inoculation Effectiveness

Svitlana Volkova; Will Dupree; Hsien-Te Kao; Peter Bautista; Gabe Ganberg; Jeff Beaubien; Laura Cassani

arXiv:2511.21749·cs.CL·December 1, 2025

Proactive Defense: Compound AI for Detecting Persuasion Attacks and Measuring Inoculation Effectiveness

Svitlana Volkova, Will Dupree, Hsien-Te Kao, Peter Bautista, Gabe Ganberg, Jeff Beaubien, Laura Cassani

PDF

Open Access

TL;DR

This paper presents BRIES, a comprehensive AI system for detecting persuasion attacks, evaluating inoculation strategies, and analyzing vulnerabilities across language models, advancing AI safety and cognitive security.

Contribution

Introduces BRIES, a novel compound AI architecture with specialized agents for persuasion attack detection, inoculation, and causal analysis, enhancing understanding of AI vulnerabilities and resilience strategies.

Findings

01

GPT-4 outperforms open-source models in detection accuracy.

02

Detection performance varies significantly across language models.

03

Prompt engineering impacts detection efficacy and model-specific performance.

Abstract

This paper introduces BRIES, a novel compound AI architecture designed to detect and measure the effectiveness of persuasion attacks across information environments. We present a system with specialized agents: a Twister that generates adversarial content employing targeted persuasion tactics, a Detector that identifies attack types with configurable parameters, a Defender that creates resilient content through content inoculation, and an Assessor that employs causal inference to evaluate inoculation effectiveness. Experimenting with the SemEval 2023 Task 3 taxonomy across the synthetic persuasion dataset, we demonstrate significant variations in detection performance across language agents. Our comparative analysis reveals significant performance disparities with GPT-4 achieving superior detection accuracy on complex persuasion techniques, while open-source models like Llama3 and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Misinformation and Its Impacts · Ethics and Social Impacts of AI