Can Large Language Models Always Solve Easy Problems if They Can Solve   Harder Ones?

Zhe Yang; Yichang Zhang; Tianyu Liu; Jian Yang; Junyang Lin; Chang; Zhou; Zhifang Sui

arXiv:2406.12809·cs.CL·June 19, 2024

Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Zhe Yang, Yichang Zhang, Tianyu Liu, Jian Yang, Junyang Lin, Chang, Zhou, Zhifang Sui

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the inconsistency of large language models in solving problems of varying difficulty, introduces a benchmark and metric for measuring this inconsistency, and analyzes factors affecting model consistency.

Contribution

The paper develops the ConsisEval benchmark and consistency score to evaluate and analyze LLM inconsistency across easy and hard problems.

Findings

01

GPT-4 achieves 92.2% consistency score but still has specific failures.

02

Stronger models generally show higher consistency, with some exceptions.

03

Hard data improves model consistency in fine-tuning and in-context learning.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues (e.g. LLMs can react differently to disturbances like rephrasing or inconsequential order change). In addition to these inconsistencies, we also observe that LLMs, while capable of solving hard problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy inconsistency, we develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. Furthermore, we introduce the concept of consistency score to quantitatively measure this inconsistency and analyze the potential for improvement in consistency by relative consistency score. Based on comprehensive experiments across a variety of existing models, we find: (1) GPT-4 achieves the highest consistency score of 92.2\% but is still inconsistent to specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

QwenLM/ConsisEval
noneOfficial

Videos

Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?· underline

Taxonomy

TopicsTopic Modeling

MethodsResidual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer