Control Barrier Function for Aligning Large Language Models
Yuya Miyaoka, Masaki Inoue

TL;DR
This paper introduces a control barrier function-based safety filter for aligning large language models, enabling safe text generation without modifying the original models.
Contribution
It presents a novel, add-on safety filter framework using control barrier functions to improve LLM alignment without fine-tuning.
Findings
The safety filter can be applied without fine-tuning the baseline LLM.
It can incorporate existing evaluation models for alignment.
The framework is implemented with open-source language models.
Abstract
This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the CBF safety filter to the predicted token generated from the baseline LLM, to intervene in the generated text. The safety filter includes two significant advantages: this safety filter is an add-on type, allowing it to be used for alignment purposes without fine-tuning the baseline LLM, and if there is an evaluation model regarding the desired alignment, it can be directly applied to the filter design. The overall text-generation system is implemented with open-source language models, aiming to generate positive text.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
