Language Model Decoding as Direct Metrics Optimization
Haozhe Ji, Pei Ke, Hongning Wang, Minlie Huang

TL;DR
This paper introduces a novel decoding method for language models that optimizes multiple metrics simultaneously, resulting in texts that better align with human preferences and improve upon existing decoding techniques.
Contribution
It formulates decoding as a metrics optimization problem with an analytical solution, guaranteeing improved perplexity and better alignment with human texts.
Findings
Outperforms baseline methods in metrics alignment
Achieves higher human evaluation scores
Demonstrates effectiveness across domains and model scales
Abstract
Despite the remarkable advances in language modeling, current mainstream decoding methods still struggle to generate texts that align with human texts across different aspects. In particular, sampling-based methods produce less-repetitive texts which are often disjunctive in discourse, while search-based methods maintain topic coherence at the cost of increased repetition. Overall, these methods fall short in achieving holistic alignment across a broad range of aspects. In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts measured by multiple metrics of desired aspects simultaneously. The resulting decoding distribution enjoys an analytical solution that scales the input language model distribution via a sequence-level energy function defined by these metrics. And most importantly,…
Peer Reviews
Decision·ICLR 2024 poster
- The paper is well written and straightforward to follow. - The motivation is sound, the authors provide an approach for estimating $\mu$, and for sampling 'efficiently' (although the time complexity of this approach is not given). - Strong experimental results: - over a range of datasets - against quite a few decoding approaches commonly used in practice - ablation studies - a range of metrics (Repetition, Coherence, Diversity, Information Content) - The problem is of great import
- Please provide complexity analysis of the sampling approach and compare to competing approaches (Greedy, Top-k, CD, CS, etc). While the experimental results are strong it is important to compare the runtime of this method to determine practical efficacy given the authors claim it is "efficient" in the conclusion. - Analysis on the convergence of $\mu$ in Algorithm 1 and some sensitivity analysis to initialization would be helpful for practitioners. - Given that different metrics perform stron
* The method is novel and provides a nice lightweight alternative to fine-tuning methods. It may thus be widely accessible to practitioners without the ability to tune larger language models * The empirical component of the paper is comprehensive, including nice ablation studies
* The mathematical motivations given by this paper are quite weakly supported/explained. In general, the language used by the authors with respect to this topic is confusing and informal. For example, they motivate their use of the reverse KL for choosing the parameters of q by stating “KL(q || pθ) restricts the decoding distribution q to deviate minimally from the LM distribution pθ”, but the same argument could be made for the forward variant of the divergence. Similar language is scattered ac
1. The proposed framework makes sense and is technically sound. Based on their constructed framework, they propose a reasonable approximation that can work in practice. 2. Experiments demonstrate good empirical results compared to many other decoding algorithms. 3. Both automatic and human evaluations are conducted to provide insights into their method.
1. Related works such as [1] are missing. It is worth discussing and comparing with these methods in the paper. 2. The paper only conducts experiments in two open-ended text generation tasks, where both automatic and human evaluations are hard. On the other hand, evaluations on other text generation tasks such as machine translation and text summarization are more accurate and can better indicate if their method is indeed effective. 3. Their method can be more computational than baselines. [
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsALIGN
