TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for   Human-Aligned LLMs

Shuyi Xie; Wenlin Yao; Yong Dai; Shaobo Wang; Donlin Zhou; Lifeng Jin,; Xinhua Feng; Pengzhi Wei; Yujie Lin; Zhichao Hu; Dong Yu; Zhengyou Zhang,; Jing Nie; Yuhong Liu

arXiv:2311.05374·cs.CL·November 10, 2023·1 cites

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

Shuyi Xie, Wenlin Yao, Yong Dai, Shaobo Wang, Donlin Zhou, Lifeng Jin,, Xinhua Feng, Pengzhi Wei, Yujie Lin, Zhichao Hu, Dong Yu, Zhengyou Zhang,, Jing Nie, Yuhong Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a comprehensive hierarchical human evaluation framework for assessing the alignment of large language models with human preferences across diverse real-world tasks, including detailed standards and a large test set.

Contribution

It presents a novel hierarchical evaluation methodology, a standardized process, and a large dataset for assessing LLMs' human alignment in multiple languages and domains.

Findings

01

Effective in benchmarking Tencent Hunyuan LLMs

02

Supports both English and Chinese evaluations

03

Automating parts of evaluation with GPT-4 is feasible

Abstract

Large language models (LLMs) have shown impressive capabilities across various natural language tasks. However, evaluating their alignment with human preferences remains a challenge. To this end, we propose a comprehensive human evaluation framework to assess LLMs' proficiency in following instructions on diverse real-world tasks. We construct a hierarchical task tree encompassing 7 major areas covering over 200 categories and over 800 tasks, which covers diverse capabilities such as question answering, reasoning, multiturn dialogue, and text generation, to evaluate LLMs in a comprehensive and in-depth manner. We also design detailed evaluation standards and processes to facilitate consistent, unbiased judgments from human evaluators. A test set of over 3,000 instances is released, spanning different difficulty levels and knowledge domains. Our work provides a standardized methodology…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xsysigma/tencentllmeval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsSparse Evolutionary Training