Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines
Mazal Bethany, Kim-Kwang Raymond Choo, Nishant Vishwamitra, and Peyman Najafirad

TL;DR
This paper introduces a black-box adversarial attack framework for NLP pipelines, revealing architectural vulnerabilities and proposing defenses to improve robustness against meaning-preserving rewrites within strict query limits.
Contribution
It formalizes a realistic black-box threat model and develops a two-agent framework that effectively evades NLP systems, highlighting how architecture influences vulnerability.
Findings
Evasion rates of 19.95% to 40.34% against modern LLM-based pipelines.
Legacy static retrieval systems are nearly totally vulnerable with 97.02% evasion.
A pattern-informed defense reduces evasion by up to 65.18%.
Abstract
Multi-component natural language processing (NLP) pipelines are increasingly deployed for high-stakes decisions, yet no existing adversarial method can test their robustness under realistic conditions: binary-only feedback, no gradient access, and strict query budgets. We formalize this strict black-box threat model and propose a two-agent evasion framework operating in a semantic perturbation space. An Attacker Agent generates meaning-preserving rewrites while a Prompt Optimization Agent refines the attack strategy using only binary decision feedback within a 10-query budget. Evaluated against four evidence-based misinformation detection pipelines, the framework achieves evasion rates of 19.95 to 40.34% on modern large language model (LLM) based systems, compared to at most 3.90% for token-level perturbation baselines that rely on surrogate models because they cannot operate under our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
