MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While   Preserving Their Usability

Yanrui Du; Sendong Zhao; Danyang Zhao; Ming Ma; Yuhan Chen; Liangyu; Huo; Qing Yang; Dongliang Xu; Bing Qin

arXiv:2405.14488·cs.CL·May 24, 2024·3 cites

MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability

Yanrui Du, Sendong Zhao, Danyang Zhao, Ming Ma, Yuhan Chen, Liangyu, Huo, Qing Yang, Dongliang Xu, Bing Qin

PDF

Open Access 1 Repo 3 Models

TL;DR

The paper introduces MoGU, a framework that improves the safety of open-source LLMs by balancing safe and usable responses through dynamic routing, addressing the limitations of existing rejection-focused defense strategies.

Contribution

MoGU is a novel framework that transforms LLMs into safe and usable variants with a dynamic routing mechanism to balance their contributions, enhancing safety without sacrificing usability.

Findings

01

MoGU outperforms existing defense strategies in safety and usability.

02

The routing mechanism effectively balances safe and usable responses.

03

Safer versions of Llama2, Vicuna, Falcon, Dolphin, and Baichuan2 were released.

Abstract

Large Language Models (LLMs) are increasingly deployed in various applications. As their usage grows, concerns regarding their safety are rising, especially in maintaining harmless responses when faced with malicious instructions. Many defense strategies have been developed to enhance the safety of LLMs. However, our research finds that existing defense strategies lead LLMs to predominantly adopt a rejection-oriented stance, thereby diminishing the usability of their responses to benign instructions. To solve this problem, we introduce the MoGU framework, designed to enhance LLMs' safety while preserving their usability. Our MoGU framework transforms the base LLM into two variants: the usable LLM and the safe LLM, and further employs dynamic routing to balance their contribution. When encountering malicious instructions, the router will assign a higher weight to the safe LLM to ensure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dyr1/mogu
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security · Semantic Web and Ontologies · Scientific Computing and Data Management

MethodsBalanced Selection