Semi-Supervised Learning for Large Language Models Safety and Content Moderation

Eduard Stefan Dinuta; Iustin Sirbu; Traian Rebedea

arXiv:2512.21107·cs.CL·December 25, 2025

Semi-Supervised Learning for Large Language Models Safety and Content Moderation

Eduard Stefan Dinuta, Iustin Sirbu, Traian Rebedea

PDF

Open Access

TL;DR

This paper explores semi-supervised learning methods to enhance safety and content moderation in Large Language Models, reducing reliance on labeled data and emphasizing task-specific data augmentation for better safety classifier performance.

Contribution

It introduces semi-supervised techniques for safety classification in LLMs and highlights the importance of task-specific augmentation over general methods.

Findings

01

Semi-supervised learning improves safety classifier accuracy.

02

Task-specific augmentation significantly boosts performance.

03

Reduces dependence on large labeled datasets.

Abstract

Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Topic Modeling