Unifying Adversarial Robustness and Training Across Text Scoring Models

Manveer Singh Tamber; Hosna Oyarhoseini; Jimmy Lin

arXiv:2602.00857·cs.CL·February 3, 2026

Unifying Adversarial Robustness and Training Across Text Scoring Models

Manveer Singh Tamber, Hosna Oyarhoseini, Jimmy Lin

PDF

Open Access

TL;DR

This paper unifies the study of adversarial robustness across various text scoring models, proposing new training methods that improve robustness and alignment in language models, especially in RLHF applications.

Contribution

It introduces a unified framework for adversarial robustness in text scoring models and develops new training methods that enhance robustness and task effectiveness.

Findings

01

Adversarial training methods improve robustness across models.

02

Combining training methods yields better generalization.

03

Adversarially trained reward models reduce reward hacking.

Abstract

Research on adversarial robustness in language models is currently fragmented across applications and attacks, obscuring shared vulnerabilities. In this work, we propose unifying the study of adversarial robustness in text scoring models spanning dense retrievers, rerankers, and reward models. This motivates adapting both attacks and adversarial training methods across model roles. Unlike open-ended generation, text scoring failures are directly testable: an attack succeeds when an irrelevant or rejected text outscores a relevant or chosen one. Using this principled lens of text scoring, we demonstrate that current adversarial training formulations for language models are often short-sighted, failing to effectively generalize across attacks. To address this, we introduce multiple adversarial training methods for text scoring models and show that combining complementary training methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection