Task Calibration: Calibrating Large Language Models on Inference Tasks
Yingjie Li, Yun Luo, Xiaotian Xie, Yue Zhang

TL;DR
This paper introduces task calibration (TC), a method to improve large language models' reasoning on inference tasks by reducing reliance on spurious correlations, leading to better zero-shot and few-shot performance.
Contribution
The paper proposes a novel task calibration approach inspired by mutual information that enhances LLM reasoning by balancing premise and hypothesis considerations.
Findings
TC significantly improves zero-shot performance on 13 inference tasks.
TC is effective in few-shot and diverse natural language understanding tasks.
TC is robust to different prompt templates and compatible with other calibration methods.
Abstract
Large language models (LLMs) have exhibited impressive zero-shot performance on inference tasks. However, LLMs may suffer from spurious correlations between input texts and output labels, which limits LLMs' ability to reason based purely on general language understanding. In other words, LLMs may make predictions primarily based on premise or hypothesis, rather than both components. To address this problem that may lead to unexpected performance degradation, we propose task calibration (TC), a zero-shot and inference-only calibration method inspired by mutual information which recovers LLM performance through task reformulation. TC encourages LLMs to reason based on both premise and hypothesis, while mitigating the models' over-reliance on individual premise or hypothesis for inference. Experimental results show that TC achieves a substantial improvement on 13 inference tasks in the…
Peer Reviews
Decision·Submitted to ICLR 2025
The approach is simple and if the results hold, might be a useful method to calibrate LLMs for NLI based reasoning tasks.
The paper has several flaws: For motivation, the paper cites papers such as Gururangan et al (2018), which study biases in NLI models and papers such as McKenna et al (2023) that studies a different bias in LLMs for NLI tasks. While the former work is done in models fine-tuned for NLI, the latter shows evidence for specific biases in terms of memorization and term frequency. This is a misleading equivalence in the introduction section. This paper would have benefitted from analyzing the biases
1. The paper has enough novelty. Although mutual information is not new, applying it to the inference score function can be considered novel. 2. It includes all the previously related works and lists the differences. 3. The paper writing is clear, and the visuals are good. 4. It has detailed experiments and results analysis.
N/A
- The paper proposes a new calibration method for natural language inference via generative language models, which has been shown to be promising by experiments. - The method is experimented on comprehensive datasets and models, which makes the conclusion solid.
- While the author is claiming the discovery of premise side spurious correlation to be an important contribution, many previous works have studied the hypothesis side spurious correlation (also as cited). There is not significant difference between the roles of premise and hypothesis in natural language inference, which makes the contribution of this discovery incremental. - The studied paradigm is a bit too narrow, which improves a method of solving a specific task (natural language inference
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
