NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels
Junfeng Fang, Nachuan Chen, Houcheng Jiang, Dan Zhang, Fei Shen, Xiang Wang, Xiangnan He, Tat-Seng Chua

TL;DR
NExT-Guard offers a training-free, real-time safety safeguard for large language models by leveraging latent features from pre-trained autoencoders, outperforming supervised methods without requiring token-level labels.
Contribution
It introduces a novel, training-free streaming safeguard framework that utilizes pretrained autoencoders to detect unsafe content in real-time without token-level supervision.
Findings
Outperforms supervised streaming safeguards in robustness
Works effectively across different models and risk scenarios
Enables low-cost, scalable deployment of real-time safety measures
Abstract
Large language models are increasingly deployed in streaming scenarios, rendering conventional post-hoc safeguards ineffective as they fail to interdict unsafe content in real-time. While streaming safeguards based on token-level supervised training could address this, they necessitate expensive annotations and suffer from severe overfitting. In this work, we challenge the paradigm that streaming safety must rely on token-level supervised training. Instead, it is an inherent capability of well-trained post-hoc safeguards, as they already encode token-level risk signals in hidden representations. Hence, we introduce NExT-Guard, a training-free framework that achieves streaming safeguards by monitoring interpretable latent features from Sparse Autoencoders (SAEs). It uses pretrained SAEs from publicly available base LLMs, enabling flexible, low-cost deployment without token-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
