DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings   in LLMs

Ben Ganon; Alon Zolfi; Omer Hofman; Inderjeet Singh; Hisashi Kojima,; Yuval Elovici; Asaf Shabtai

arXiv:2411.19038·cs.CL·March 11, 2025·2 cites

DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs

Ben Ganon, Alon Zolfi, Omer Hofman, Inderjeet Singh, Hisashi Kojima,, Yuval Elovici, Asaf Shabtai

PDF

Open Access 1 Video

TL;DR

DIESEL is a lightweight, flexible inference-guidance method for LLMs that filters unsafe responses by reranking tokens based on semantic similarity to negative concepts, improving safety without high computational costs.

Contribution

DIESEL introduces a novel, efficient approach to semantically filter LLM responses, enhancing safety and generalization with minimal computational overhead.

Findings

01

Effective in filtering unsafe responses in conversational models

02

Performs well even in adversarial jailbreaking scenarios

03

Generalizes to non-safety-related response filtering

Abstract

In recent years, large language models (LLMs) have had great success in tasks such as casual conversation, contributing to significant advancements in domains like virtual assistance. However, they often generate responses that are not aligned with human values (e.g., ethical standards, safety), leading to potentially unsafe or inappropriate outputs. While several techniques have been proposed to address this problem, they come with a cost, requiring computationally expensive training or dramatically increasing the inference time. In this paper, we present DIESEL, a lightweight inference-guidance technique that can be seamlessly integrated into any autoregressive LLM to semantically filter undesired concepts from the response. DIESEL can function either as a standalone safeguard or as an additional layer of defense, enhancing response safety by reranking the LLM's proposed tokens based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DIESEL - Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs· underline

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques

MethodsLLaMA