PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao, Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang,, Yue Zhang

TL;DR
PandaLM is a novel large language model-based evaluation benchmark designed to assess instruction-tuned LLMs by considering subjective factors like clarity and adherence, providing a cost-effective and privacy-preserving alternative to traditional evaluation methods.
Contribution
This paper introduces PandaLM, a judge LLM trained to evaluate other models based on subjective criteria, and demonstrates its effectiveness and reliability in LLM evaluation.
Findings
PandaLM-7B achieves 93.75% of GPT-3.5's evaluation ability.
PandaLM evaluation correlates well with human preferences.
Models tuned with PandaLM outperform those with default hyperparameters.
Abstract
Instruction tuning large language models (LLMs) remains a challenging task, owing to the complexity of hyperparameter selection and the difficulty involved in evaluating the tuned models. To determine the optimal hyperparameters, an automatic, robust, and reliable evaluation benchmark is essential. However, establishing such a benchmark is not a trivial task due to the challenges associated with evaluation accuracy and privacy protection. In response to these challenges, we introduce a judge large language model, named PandaLM, which is trained to distinguish the superior model given several LLMs. PandaLM's focus extends beyond just the objective correctness of responses, which is the main focus of traditional evaluation datasets. It addresses vital subjective factors such as relative conciseness, clarity, adherence to instructions, comprehensiveness, and formality. To ensure the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Multi-Head Attention · Attention Is All You Need · Dropout · Linear Layer · Attention Dropout · Cosine Annealing · {Dispute@FaQ-s}How to file a dispute with Expedia? · Residual Connection
