INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models
Yew Ken Chia, Pengfei Hong, Lidong Bing, Soujanya Poria

TL;DR
INSTRUCTEVAL provides a comprehensive evaluation framework for instruction-tuned large language models, assessing their problem-solving, writing, and alignment capabilities to better understand their full potential and guide future improvements.
Contribution
This work introduces INSTRUCTEVAL, a holistic evaluation suite specifically designed for instruction-tuned large language models, addressing gaps in understanding their capabilities and limitations.
Findings
Instruction data quality is the most critical factor for model performance.
Open-source models excel in writing but need improvement in problem-solving and alignment.
Rigorous evaluation is essential for validating claims about model capabilities.
Abstract
Instruction-tuned large language models have revolutionized natural language processing and have shown great potential in applications such as conversational agents. These models, such as GPT-4, can not only master language but also solve complex tasks in areas like mathematics, coding, medicine, and law. Despite their impressive capabilities, there is still a lack of comprehensive understanding regarding their full potential, primarily due to the black-box nature of many models and the absence of holistic evaluation studies. To address these challenges, we present INSTRUCTEVAL, a more comprehensive evaluation suite designed specifically for instruction-tuned large language models. Unlike previous works, our evaluation involves a rigorous assessment of models based on problem-solving, writing ability, and alignment to human values. We take a holistic approach to analyze various factors…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗declare-lab/flan-alpaca-xlmodel· 108 dl· ♡ 118108 dl♡ 118
- 🤗declare-lab/flan-alpaca-basemodel· 185 dl· ♡ 35185 dl♡ 35
- 🤗declare-lab/flan-alpaca-largemodel· 138 dl· ♡ 48138 dl♡ 48
- 🤗declare-lab/flan-alpaca-xxlmodel· 5 dl· ♡ 395 dl♡ 39
- 🤗declare-lab/flan-gpt4all-xlmodel· 25 dl· ♡ 2525 dl♡ 25
- 🤗declare-lab/flan-alpaca-xl-loramodel· ♡ 4♡ 4
- 🤗declare-lab/flan-sharegpt-xlmodel· 6 dl· ♡ 116 dl♡ 11
- 🤗declare-lab/flan-alpaca-gpt4-xlmodel· 355 dl· ♡ 43355 dl♡ 43
- 🤗jncraton/flan-alpaca-base-ct2-int8model· 2 dl2 dl
- 🤗jncraton/flan-alpaca-xl-ct2-int8model· 5 dl5 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education
MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Linear Layer · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection · Layer Normalization · Adam · Byte Pair Encoding
