FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization

Shunan Zhu; Jiawei Chen; Yonghao Yu; Hideya Ochiai

arXiv:2604.06833·cs.CR·April 9, 2026

FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization

Shunan Zhu, Jiawei Chen, Yonghao Yu, Hideya Ochiai

PDF

TL;DR

FedDetox introduces a robust federated learning framework for small language models that sanitizes on-device data to prevent unsafe information from harming global model safety.

Contribution

The paper presents a novel on-device data sanitization method for federated learning, enhancing safety alignment in resource-constrained edge devices.

Findings

01

Preserves safety alignment comparable to centralized models.

02

Effectively replaces unsafe samples with safety signals.

03

Maintains model utility while enhancing safety.

Abstract

As high quality public data becomes scarce, Federated Learning (FL) provides a vital pathway to leverage valuable private user data while preserving privacy. However, real-world client data often contains toxic or unsafe information. This leads to a critical issue we define as unintended data poisoning, which can severely damage the safety alignment of global models during federated alignment. To address this, we propose FedDetox, a robust framework tailored for Small Language Models (SLMs) on resource-constrained edge devices. We first employ knowledge distillation to transfer sophisticated safety alignment capabilities from large scale safety aligned teacher models into light weight student classifiers suitable for resource constrained edge devices. Specifically, during federated learning for human preference alignment, the edge client identifies unsafe samples at the source and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.