ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models
Sharanya Dasgupta, Arkaprabha Basu, Sujoy Nath, and Swagatam Das

TL;DR
ARREST is a framework that improves large language models by externally regulating their outputs to enhance safety and factual accuracy without fine-tuning the models themselves.
Contribution
It introduces a novel external regulation method that corrects representational drift in LLMs, improving safety and truthfulness without parameter fine-tuning.
Findings
ARREST effectively corrects factual and safety misalignments.
It outperforms RLHF models in generating soft refusals.
The framework is versatile and adaptable to adversarial training scenarios.
Abstract
Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, LLMs have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space, rather than addressing those as entirely separate alignment issues. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the model parameters themselves. Reflecting the hypothesis, we propose ARREST (Adversarial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
