Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

Jun Wang; Ninglun Gu; Kailai Zhang; Zijiao Zhang; Yelun Bao; Jin Yang; Xu Yin; Liwei Liu; Yihuan Liu; Pengyong Li; Gary G. Yen; Junchi Yan

arXiv:2508.18646·cs.AI·November 19, 2025

Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

Jun Wang, Ninglun Gu, Kailai Zhang, Zijiao Zhang, Yelun Bao, Jin Yang, Xu Yin, Liwei Liu, Yihuan Liu, Pengyong Li, Gary G. Yen, Junchi Yan

PDF

TL;DR

This paper proposes a holistic, anthropomorphic evaluation framework for Large Language Models that assesses their general intelligence, emotional alignment, and professional expertise, addressing limitations of current benchmark-centric methods.

Contribution

It introduces a novel three-dimensional taxonomy and a value-oriented evaluation framework to better measure LLMs' real-world utility and ethical alignment.

Findings

01

Analysis of 200+ benchmarks highlights key evaluation challenges.

02

The proposed framework offers a comprehensive assessment of LLM capabilities.

03

A curated repository of open-source evaluation resources is provided.

Abstract

For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence, proposing a novel three-dimensional taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational capacity, Emotional Quotient (EQ)-Alignment Ability for value-based interactions, and Professional Quotient (PQ)-Professional Expertise for specialized proficiency. For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability. Our modular architecture integrates six components with an implementation roadmap. Through analysis of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.