WARP: Guaranteed Inner-Layer Repair of NLP Transformers
Hsin-Ling Hsu, Min-Yu Chen, Nai-Chia Chen, Yan-Ru Chen, Yi-Ling Chang, Fang Yu

TL;DR
WARP is a novel, provability-guaranteed repair framework for Transformer NLP models that extends repair beyond the last layer, ensuring robustness and correctness through convex optimization.
Contribution
WARP introduces a constraint-based, convex quadratic programming approach for guaranteed, high-dimensional repair of Transformer models beyond the final layer.
Findings
WARP achieves practical robustness guarantees on encoder-only Transformers.
The method extends repair capabilities to multiple layers, not just the last.
Empirical results show improved adversarial robustness while maintaining correctness.
Abstract
Transformer-based NLP models remain vulnerable to adversarial perturbations, yet existing repair methods face a fundamental trade-off: gradient-based approaches offer flexibility but lack verifiability and often overfit; methods that do provide repair guarantees are restricted to the final layer or small networks, significantly limiting the parameter search space available for repair. We present WARP (Weight-Adjusted Repair with Provability), a constraint-based repair framework that extends repair beyond the last layer of Transformer models. WARP formulates repair as a convex quadratic program derived from a first-order linearization of the logit gap, enabling tractable optimization over a high-dimensional parameter space. Under the condition that the first-order approximation holds, this formulation induces three per-sample guarantees: (i) a positive margin constraint ensuring correct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
