PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning   Optimization

Yidong Wang; Zhuohao Yu; Zhengran Zeng; Linyi Yang; Cunxiang Wang; Hao; Chen; Chaoya Jiang; Rui Xie; Jindong Wang; Xing Xie; Wei Ye; Shikun Zhang,; Yue Zhang

arXiv:2306.05087·cs.CL·May 27, 2024·29 cites

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao, Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang,, Yue Zhang

PDF

Open Access 2 Repos 2 Models

TL;DR

PandaLM is a novel large language model-based evaluation benchmark designed to assess instruction-tuned LLMs by considering subjective factors like clarity and adherence, providing a cost-effective and privacy-preserving alternative to traditional evaluation methods.

Contribution

This paper introduces PandaLM, a judge LLM trained to evaluate other models based on subjective criteria, and demonstrates its effectiveness and reliability in LLM evaluation.

Findings

01

PandaLM-7B achieves 93.75% of GPT-3.5's evaluation ability.

02

PandaLM evaluation correlates well with human preferences.

03

Models tuned with PandaLM outperform those with default hyperparameters.

Abstract

Instruction tuning large language models (LLMs) remains a challenging task, owing to the complexity of hyperparameter selection and the difficulty involved in evaluating the tuned models. To determine the optimal hyperparameters, an automatic, robust, and reliable evaluation benchmark is essential. However, establishing such a benchmark is not a trivial task due to the challenges associated with evaluation accuracy and privacy protection. In response to these challenges, we introduce a judge large language model, named PandaLM, which is trained to distinguish the superior model given several LLMs. PandaLM's focus extends beyond just the objective correctness of responses, which is the main focus of traditional evaluation datasets. It addresses vital subjective factors such as relative conciseness, clarity, adherence to instructions, comprehensiveness, and formality. To ensure the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Multi-Head Attention · Attention Is All You Need · Dropout · Linear Layer · Attention Dropout · Cosine Annealing · {Dispute@FaQ-s}How to file a dispute with Expedia? · Residual Connection