Quark: Controllable Text Generation with Reinforced Unlearning
Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter, West, Prithviraj Ammanabrolu, Yejin Choi

TL;DR
Quark is a novel method for fine-tuning language models to unlearn undesirable behaviors like toxicity and repetition by using reward-based conditioning, outperforming existing reinforcement learning approaches.
Contribution
Introduces Quantized Reward Konditioning (Quark), a new algorithm for controlled unlearning in language models that leverages reward quantiles and standard language modeling techniques.
Findings
Quark effectively reduces toxicity, negative sentiment, and repetition in generated text.
Outperforms PPO and other baselines in unlearning undesirable behaviors.
Relies solely on standard language modeling primitives, simplifying implementation.
Abstract
Large-scale language models often learn behaviors that are misaligned with user expectations. Generated text may contain offensive or toxic language, contain significant repetition, or be of a different sentiment than desired by the user. We consider the task of unlearning these misalignments by fine-tuning the language model on signals of what not to do. We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property, while not straying too far from the original model. Quark alternates between (i) collecting samples with the current language model, (ii) sorting them into quantiles based on reward, with each quantile identified by a reward token prepended to the language model's input, and (iii) using a standard language modeling loss on samples from each quantile conditioned on its reward token, while remaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
MethodsEntropy Regularization · Proximal Policy Optimization
