Language Model Cascades: Token-level uncertainty and beyond
Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh, Rawat, Aditya Krishna Menon, Sanjiv Kumar

TL;DR
This paper explores how to efficiently use language models by deferring difficult tasks to larger models, focusing on token-level uncertainty measures to improve cost-effectiveness in generative NLP tasks.
Contribution
It introduces token-level uncertainty-based deferral rules for LM cascades, addressing length bias issues and enhancing cost-quality tradeoffs with learned strategies and model embeddings.
Findings
Token-level uncertainty improves deferral decisions.
Learned deferral rules outperform simple aggregation.
Embedding information boosts cascade performance.
Abstract
Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks, but at the expense of increased inference costs. Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs: here, a small model is invoked for most "easy" instances, while a few "hard" instances are deferred to the large model. While the principles underpinning cascading are well-studied for classification tasks - with deferral based on predicted class uncertainty favored theoretically and practically - a similar understanding is lacking for generative LM tasks. In this work, we initiate a systematic study of deferral rules for LM cascades. We begin by examining the natural extension of predicted class uncertainty to generative LM tasks, namely, the predicted sequence uncertainty. We show that this measure suffers from the length bias problem, either…
Peer Reviews
Decision·ICLR 2024 poster
This paper presents a simple approach that seems to be very effective. The connection to rejection in classifiers is intuitive but had not occurred to. The paper is easy to follow. The clarity of presentation convinces me that it would be easy for me to try the approach myself, either for its own utility or as a replication study. I admire the accessibility of this work.
Although the paper as a whole is very clear, there are places where the experiments lack specifics (see the questions section). Additionally, there are places where the experimental set-up seems to be suboptimal: greedy decoding was used, but this is more prone to hallucination than beam search. Some of the conclusions of the experiments might be artifacts of these hallucinations. I can think specifically of these two: 1) there was a negative correlation between translation quality and sequenc
- A principled method for uncertainty estimation in LLM (i.e. FLAN-T5). - Clear description of background knowledge and related work needed to understand the proposed method. - The authors perform a comprehensive comparison of the proposed method with different NLP tasks.
- Motivation for the lack of comparison with other uncertainty estimation methods. - A possible extra contribution can be the use or discussion of the method for NLP tasks under out-of-distribution (OOD) or domain adaptation.
1. It demonstrates that simple sequence-level LM confidence measures for deferral can lead to sub-optimal cost-quality tradeoffs due to length bias. 2. The paper proposes a simple yet effective method employing the quantile of the log-likelihood to design a deferral rule. 3. The Proposal of a post-hoc deferral rule trained on quantile features and the input embeddings of both the small LM and the large LM. The extensive experiments on FLAN-T5 verify the efficacy of the method.
1. Compared with simple averaging the log probability, the major advantage of quantile is that it reflects more about the overall log probability distribution of the sequence and is more robust to outliers. To highlight the motivation of the proposal, more evidence for the existence of the outlier is expected and the showcase in Figure 1 is not sufficient. 2. In Figure 2 and Figure 3, it seems that the best generation performance is obtained in the middle of the curve, other than the endpoint wh
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsFlan-T5
