Monotonicity as an Architectural Bias for Robust Language Models
Patrick Cooper, Alireza Nadali, Ashutosh Trivedi, Alvaro Velasquez

TL;DR
This paper introduces a selective monotonicity bias in Transformer-based language models, significantly enhancing robustness against adversarial attacks while maintaining performance.
Contribution
It demonstrates that enforcing monotonicity in specific model components improves robustness without sacrificing expressivity or accuracy.
Findings
Adversarial attack success rate drops from 69% to 19%.
Monotonic models maintain performance on standard tasks.
Selective monotonicity enhances robustness with minimal performance loss.
Abstract
Large language models (LLMs) are known to exhibit brittle behavior under adversarial prompts and jailbreak attacks, even after extensive alignment and fine-tuning. This fragility reflects a broader challenge of modern neural language models: small, carefully structured perturbations in high-dimensional input spaces can induce large and unpredictable changes in internal semantic representations and output. We investigate monotonicity as an architectural inductive bias for improving the robustness of Transformer-based language models. Monotonicity constrains semantic transformations so that strengthening information, evidence, or constraints cannot lead to regressions in the corresponding internal representations. Such order-preserving behavior has long been exploited in control and safety-critical systems to simplify reasoning and improve robustness, but has traditionally been viewed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
