Proactive Defense: Compound AI for Detecting Persuasion Attacks and Measuring Inoculation Effectiveness
Svitlana Volkova, Will Dupree, Hsien-Te Kao, Peter Bautista, Gabe Ganberg, Jeff Beaubien, Laura Cassani

TL;DR
This paper presents BRIES, a comprehensive AI system for detecting persuasion attacks, evaluating inoculation strategies, and analyzing vulnerabilities across language models, advancing AI safety and cognitive security.
Contribution
Introduces BRIES, a novel compound AI architecture with specialized agents for persuasion attack detection, inoculation, and causal analysis, enhancing understanding of AI vulnerabilities and resilience strategies.
Findings
GPT-4 outperforms open-source models in detection accuracy.
Detection performance varies significantly across language models.
Prompt engineering impacts detection efficacy and model-specific performance.
Abstract
This paper introduces BRIES, a novel compound AI architecture designed to detect and measure the effectiveness of persuasion attacks across information environments. We present a system with specialized agents: a Twister that generates adversarial content employing targeted persuasion tactics, a Detector that identifies attack types with configurable parameters, a Defender that creates resilient content through content inoculation, and an Assessor that employs causal inference to evaluate inoculation effectiveness. Experimenting with the SemEval 2023 Task 3 taxonomy across the synthetic persuasion dataset, we demonstrate significant variations in detection performance across language agents. Our comparative analysis reveals significant performance disparities with GPT-4 achieving superior detection accuracy on complex persuasion techniques, while open-source models like Llama3 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Misinformation and Its Impacts · Ethics and Social Impacts of AI
