Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines

Mazal Bethany; Kim-Kwang Raymond Choo; Nishant Vishwamitra; and Peyman Najafirad

arXiv:2604.23483·cs.AI·April 28, 2026

Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines

Mazal Bethany, Kim-Kwang Raymond Choo, Nishant Vishwamitra, and Peyman Najafirad

PDF

TL;DR

This paper introduces a black-box adversarial attack framework for NLP pipelines, revealing architectural vulnerabilities and proposing defenses to improve robustness against meaning-preserving rewrites within strict query limits.

Contribution

It formalizes a realistic black-box threat model and develops a two-agent framework that effectively evades NLP systems, highlighting how architecture influences vulnerability.

Findings

01

Evasion rates of 19.95% to 40.34% against modern LLM-based pipelines.

02

Legacy static retrieval systems are nearly totally vulnerable with 97.02% evasion.

03

A pattern-informed defense reduces evasion by up to 65.18%.

Abstract

Multi-component natural language processing (NLP) pipelines are increasingly deployed for high-stakes decisions, yet no existing adversarial method can test their robustness under realistic conditions: binary-only feedback, no gradient access, and strict query budgets. We formalize this strict black-box threat model and propose a two-agent evasion framework operating in a semantic perturbation space. An Attacker Agent generates meaning-preserving rewrites while a Prompt Optimization Agent refines the attack strategy using only binary decision feedback within a 10-query budget. Evaluated against four evidence-based misinformation detection pipelines, the framework achieves evasion rates of 19.95 to 40.34% on modern large language model (LLM) based systems, compared to at most 3.90% for token-level perturbation baselines that rely on surrogate models because they cannot operate under our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.