LLMs' Classification Performance is Overclaimed
Hanzi Xu, Renze Lou, Jiangshu Du, Vahid Mahzoon, Elmira Talebianaraki,, Zhuoan Zhou, Elizabeth Garrison, Slobodan Vucetic, Wenpeng Yin

TL;DR
This paper critically examines the overclaimed performance of Large Language Models in classification tasks by introducing a new testbed, benchmark, and evaluation metric that reveal their limitations when gold labels are absent.
Contribution
It introduces the Classify-w/o-Gold task, the Know-No benchmark, and the OmniAccuracy metric to better evaluate LLMs' true understanding in classification tasks.
Findings
LLMs struggle to correctly classify when gold labels are absent.
Performance overestimation occurs due to LLMs' reliance on label presence.
New evaluation metrics reveal limitations of LLMs in understanding tasks.
Abstract
In many classification tasks designed for AI or human to solve, gold labels are typically included within the label space by default, often posed as "which of the following is correct?" This standard setup has traditionally highlighted the strong performance of advanced AI, particularly top-performing Large Language Models (LLMs), in routine classification tasks. However, when the gold label is intentionally excluded from the label space, it becomes evident that LLMs still attempt to select from the available label candidates, even when none are correct. This raises a pivotal question: Do LLMs truly demonstrate their intelligence in understanding the essence of classification tasks? In this study, we evaluate both closed-source and open-source LLMs across representative classification tasks, arguing that the perceived performance of LLMs is overstated due to their inability to exhibit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies
