Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation

Jiatong Li; Rui Li; Qi Liu

arXiv:2309.04369·cs.CL·September 11, 2023·2 cites

Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation

Jiatong Li, Rui Li, Qi Liu

PDF

Open Access

TL;DR

This paper introduces a novel deep interaction-based framework for evaluating large language models in dynamic, real-world scenarios, overcoming limitations of static datasets and costly human assessments.

Contribution

It proposes a general evaluation framework that assesses LLMs through their interactions with other models across various real-world tasks, enabling scalable and dynamic evaluation.

Findings

01

Effective in evaluating LLMs in real-world domains

02

Applicable to multiple tasks like translation and code generation

03

Demonstrated through extensive experiments

Abstract

Large Language Models (LLMs) have made progress in various real-world tasks, which stimulates requirements for the evaluation of LLMs. Existing LLM evaluation methods are mainly supervised signal-based which depends on static datasets and cannot evaluate the ability of LLMs in dynamic real-world scenarios where deep interaction widely exists. Other LLM evaluation methods are human-based which are costly and time-consuming and are incapable of large-scale evaluation of LLMs. To address the issues above, we propose a novel Deep Interaction-based LLM-evaluation framework. In our proposed framework, LLMs' performances in real-world domains can be evaluated from their deep interaction with other LLMs in elaborately designed evaluation tasks. Furthermore, our proposed framework is a general evaluation method that can be applied to a host of real-world tasks such as machine translation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification