LLMs' Classification Performance is Overclaimed

Hanzi Xu; Renze Lou; Jiangshu Du; Vahid Mahzoon; Elmira Talebianaraki,; Zhuoan Zhou; Elizabeth Garrison; Slobodan Vucetic; Wenpeng Yin

arXiv:2406.16203·cs.CL·July 4, 2024·1 cites

LLMs' Classification Performance is Overclaimed

Hanzi Xu, Renze Lou, Jiangshu Du, Vahid Mahzoon, Elmira Talebianaraki,, Zhuoan Zhou, Elizabeth Garrison, Slobodan Vucetic, Wenpeng Yin

PDF

Open Access 1 Repo

TL;DR

This paper critically examines the overclaimed performance of Large Language Models in classification tasks by introducing a new testbed, benchmark, and evaluation metric that reveal their limitations when gold labels are absent.

Contribution

It introduces the Classify-w/o-Gold task, the Know-No benchmark, and the OmniAccuracy metric to better evaluate LLMs' true understanding in classification tasks.

Findings

01

LLMs struggle to correctly classify when gold labels are absent.

02

Performance overestimation occurs due to LLMs' reliance on label presence.

03

New evaluation metrics reveal limitations of LLMs in understanding tasks.

Abstract

In many classification tasks designed for AI or human to solve, gold labels are typically included within the label space by default, often posed as "which of the following is correct?" This standard setup has traditionally highlighted the strong performance of advanced AI, particularly top-performing Large Language Models (LLMs), in routine classification tasks. However, when the gold label is intentionally excluded from the label space, it becomes evident that LLMs still attempt to select from the available label candidates, even when none are correct. This raises a pivotal question: Do LLMs truly demonstrate their intelligence in understanding the essence of classification tasks? In this study, we evaluate both closed-source and open-source LLMs across representative classification tasks, arguing that the perceived performance of LLMs is overstated due to their inability to exhibit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xhz0809/know-no
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies