SafeTune: Search-based Harmfulness Minimisation for Large Language Models
Giordano d'Aloisio, David Williams, Giusy Annunziata, Zhiwei Fei, Antinisca Di Marco, Federica Sarro

TL;DR
This paper introduces SafeTune, a search-based method to reduce harmful responses of large language models while enhancing relevance, using hyperparameter tuning and prompt engineering, with promising initial results.
Contribution
SafeTune is a novel multi-objective search approach that effectively minimizes harmfulness and improves relevance in LLM responses through hyperparameter and prompt adjustments.
Findings
SafeTune significantly reduces harmful responses in Qwen3.5 0.8B.
SafeTune increases prompt-response relevance with a large effect size.
Greater repetition in responses is most impactful for harm reduction and relevance.
Abstract
The widespread adoption of Large Language Models (LLMs) raises concerns about the potential harmfulness of their responses. In this paper, we first investigate the harmfulness of responses from four general-purpose LLMs. Next, we propose SafeTune, a multi-objective search-based approach to mitigate harmfulness while increasing response relevance through hyperparameter tuning and system prompt engineering. Our initial evaluation shows that SafeTune significantly reduces the rate of harmful responses generated by Qwen3.5 0.8B and increases prompt-response relevance (both with a large effect size). Among the parameters we explore, we also find that encouraging greater repetition in responses is most impactful in reducing harmfulness while increasing relevance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
