Conditional Language Policy: A General Framework for Steerable   Multi-Objective Finetuning

Kaiwen Wang; Rahul Kidambi; Ryan Sullivan; Alekh Agarwal; Christoph; Dann; Andrea Michi; Marco Gelmi; Yunxuan Li; Raghav Gupta; Avinava Dubey,; Alexandre Ram\'e; Johan Ferret; Geoffrey Cideron; Le Hou; Hongkun Yu; Amr; Ahmed; Aranyak Mehta; L\'eonard Hussenot; Olivier Bachem; Edouard Leurent

arXiv:2407.15762·cs.LG·October 24, 2024

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

Kaiwen Wang, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph, Dann, Andrea Michi, Marco Gelmi, Yunxuan Li, Raghav Gupta, Avinava Dubey,, Alexandre Ram\'e, Johan Ferret, Geoffrey Cideron, Le Hou, Hongkun Yu, Amr, Ahmed, Aranyak Mehta, L\'eonard Hussenot, Olivier Bachem

PDF

Open Access

TL;DR

This paper introduces Conditional Language Policy (CLP), a flexible framework for finetuning language models to balance multiple conflicting objectives efficiently without needing multiple models.

Contribution

The paper proposes CLP, a novel method that enables steerable multi-objective finetuning of language models, outperforming existing approaches in Pareto efficiency.

Findings

01

CLP effectively trades off conflicting objectives at inference.

02

CLP outperforms existing multi-objective finetuning methods.

03

CLP does not require multiple models for different trade-offs.

Abstract

Reward-based finetuning is crucial for aligning language policies with intended behaviors (e.g., creativity and safety). A key challenge is to develop steerable language models that trade-off multiple (conflicting) objectives in a flexible and efficient manner. This paper presents Conditional Language Policy (CLP), a general framework for finetuning language models on multiple objectives. Building on techniques from multi-task training and parameter-efficient finetuning, CLP learn steerable models that effectively trade-off conflicting objectives at inference time. Notably, this does not require training or maintaining multiple models to achieve different trade-offs between the objectives. Through extensive experiments and ablations on two summarization datasets, we show that CLP learns steerable language models that outperform and Pareto-dominate the existing approaches for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLinguistic research and analysis · Syntax, Semantics, Linguistic Variation · Natural Language Processing Techniques

MethodsSparse Evolutionary Training