INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large   Language Models

Yew Ken Chia; Pengfei Hong; Lidong Bing; Soujanya Poria

arXiv:2306.04757·cs.CL·June 16, 2023·26 cites

INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models

Yew Ken Chia, Pengfei Hong, Lidong Bing, Soujanya Poria

PDF

Open Access 2 Repos 10 Models 1 Datasets

TL;DR

INSTRUCTEVAL provides a comprehensive evaluation framework for instruction-tuned large language models, assessing their problem-solving, writing, and alignment capabilities to better understand their full potential and guide future improvements.

Contribution

This work introduces INSTRUCTEVAL, a holistic evaluation suite specifically designed for instruction-tuned large language models, addressing gaps in understanding their capabilities and limitations.

Findings

01

Instruction data quality is the most critical factor for model performance.

02

Open-source models excel in writing but need improvement in problem-solving and alignment.

03

Rigorous evaluation is essential for validating claims about model capabilities.

Abstract

Instruction-tuned large language models have revolutionized natural language processing and have shown great potential in applications such as conversational agents. These models, such as GPT-4, can not only master language but also solve complex tasks in areas like mathematics, coding, medicine, and law. Despite their impressive capabilities, there is still a lack of comprehensive understanding regarding their full potential, primarily due to the black-box nature of many models and the absence of holistic evaluation studies. To address these challenges, we present INSTRUCTEVAL, a more comprehensive evaluation suite designed specifically for instruction-tuned large language models. Unlike previous works, our evaluation involves a rigorous assessment of models based on problem-solving, writing ability, and alignment to human values. We take a holistic approach to analyze various factors…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

declare-lab/InstructEvalImpact
dataset· 20 dl
20 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education

MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Linear Layer · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection · Layer Normalization · Adam · Byte Pair Encoding