A Survey of Useful LLM Evaluation
Ji-Lun Peng, Sijia Cheng, Egil Diau, Yung-Yu Shih, Po-Heng Chen,, Yen-Ting Lin, Yun-Nung Chen

TL;DR
This survey reviews the evaluation methods for large language models, proposing a two-stage framework from core abilities to agent applications, and discusses current challenges and future directions.
Contribution
It introduces a novel two-stage evaluation framework for LLMs, linking core capabilities to practical agent applications and analyzing assessment challenges.
Findings
Identifies key core abilities like reasoning and societal impact.
Defines agent capabilities such as embodied action and tool learning.
Highlights current challenges and future research directions in LLM evaluation.
Abstract
LLMs have gotten attention across various research domains due to their exceptional performance on a wide range of complex tasks. Therefore, refined methods to evaluate the capabilities of LLMs are needed to determine the tasks and responsibility they should undertake. Our study mainly discussed how LLMs, as useful tools, should be effectively assessed. We proposed the two-stage framework: from ``core ability'' to ``agent'', clearly explaining how LLMs can be applied based on their specific capabilities, along with the evaluation methods in each stage. Core ability refers to the capabilities that LLMs need in order to generate high-quality natural language texts. After confirming LLMs possess core ability, they can solve real-world and complex tasks as agent. In the "core ability" stage, we discussed the reasoning ability, societal impact, and domain knowledge of LLMs. In the ``agent''…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · Natural Language Processing Techniques · Advanced Computational Techniques and Applications
