Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI

Aravind Cheruvu; Shravya Kanchi; Sifat Muhammad Abdullah; Nicholas Ka-Shing Kong; Daphne Yao; Murtuza Jadliwala; and Bimal Viswanath

arXiv:2507.05660·cs.CR·May 22, 2026

Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI

Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Ka-Shing Kong, Daphne Yao, Murtuza Jadliwala, and Bimal Viswanath

PDF

1 Repo

TL;DR

Optimus is a novel defense framework that effectively mitigates toxicity in fine-tuned LLMs, even with biased classifiers, by combining a training-free toxicity scheme and a dual-strategy alignment process.

Contribution

It introduces a robust, training-free toxicity classification method and a dual-strategy alignment process to improve safety in conversational AI models.

Findings

01

Optimus reduces toxicity even with classifiers degraded by 85%.

02

It outperforms the state-of-the-art defense StarDSS.

03

It is resilient against adaptive adversarial and jailbreak attacks.

Abstract

Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

secml-lab-vt/Optimus
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.