Task Calibration: Calibrating Large Language Models on Inference Tasks

Yingjie Li; Yun Luo; Xiaotian Xie; Yue Zhang

arXiv:2410.18764·cs.CL·October 25, 2024

Task Calibration: Calibrating Large Language Models on Inference Tasks

Yingjie Li, Yun Luo, Xiaotian Xie, Yue Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces task calibration (TC), a method to improve large language models' reasoning on inference tasks by reducing reliance on spurious correlations, leading to better zero-shot and few-shot performance.

Contribution

The paper proposes a novel task calibration approach inspired by mutual information that enhances LLM reasoning by balancing premise and hypothesis considerations.

Findings

01

TC significantly improves zero-shot performance on 13 inference tasks.

02

TC is effective in few-shot and diverse natural language understanding tasks.

03

TC is robust to different prompt templates and compatible with other calibration methods.

Abstract

Large language models (LLMs) have exhibited impressive zero-shot performance on inference tasks. However, LLMs may suffer from spurious correlations between input texts and output labels, which limits LLMs' ability to reason based purely on general language understanding. In other words, LLMs may make predictions primarily based on premise or hypothesis, rather than both components. To address this problem that may lead to unexpected performance degradation, we propose task calibration (TC), a zero-shot and inference-only calibration method inspired by mutual information which recovers LLM performance through task reformulation. TC encourages LLMs to reason based on both premise and hypothesis, while mitigating the models' over-reliance on individual premise or hypothesis for inference. Experimental results show that TC achieves a substantial improvement on 13 inference tasks in the…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 5

Strengths

The approach is simple and if the results hold, might be a useful method to calibrate LLMs for NLI based reasoning tasks.

Weaknesses

The paper has several flaws: For motivation, the paper cites papers such as Gururangan et al (2018), which study biases in NLI models and papers such as McKenna et al (2023) that studies a different bias in LLMs for NLI tasks. While the former work is done in models fine-tuned for NLI, the latter shows evidence for specific biases in terms of memorization and term frequency. This is a misleading equivalence in the introduction section. This paper would have benefitted from analyzing the biases

Reviewer 02Rating 10Confidence 3

Strengths

1. The paper has enough novelty. Although mutual information is not new, applying it to the inference score function can be considered novel. 2. It includes all the previously related works and lists the differences. 3. The paper writing is clear, and the visuals are good. 4. It has detailed experiments and results analysis.

Weaknesses

N/A

Reviewer 03Rating 5Confidence 5

Strengths

- The paper proposes a new calibration method for natural language inference via generative language models, which has been shown to be promising by experiments. - The method is experimented on comprehensive datasets and models, which makes the conclusion solid.

Weaknesses

- While the author is claiming the discovery of premise side spurious correlation to be an important contribution, many previous works have studied the hypothesis side spurious correlation (also as cited). There is not significant difference between the roles of premise and hypothesis in natural language inference, which makes the contribution of this discovery incremental. - The studied paradigm is a bit too narrow, which improves a method of solving a specific task (natural language inference

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques