DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning   Graph

Zhehao Zhang; Jiaao Chen; Diyi Yang

arXiv:2406.17271·cs.CL·June 26, 2024

DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph

Zhehao Zhang, Jiaao Chen, Diyi Yang

PDF

Open Access 1 Repo 1 Video

TL;DR

DARG introduces a dynamic evaluation framework that adaptively generates complex reasoning data for LLMs, revealing their performance drops and biases as complexity increases, thus providing a more robust assessment method.

Contribution

This work presents a novel method to dynamically extend benchmarks by perturbing reasoning graphs, enabling adaptive and controlled evaluation of LLMs across multiple domains.

Findings

01

LLMs' performance decreases with increased data complexity

02

Certain LLMs show significant performance drops at higher complexity levels

03

Higher complexity evaluations reveal increased biases in LLMs

Abstract

The current paradigm of evaluating Large Language Models (LLMs) through static benchmarks comes with significant limitations, such as vulnerability to data contamination and a lack of adaptability to the evolving capabilities of LLMs. Therefore, evaluation methods that can adapt and generate evaluation data with controlled complexity are urgently needed. In this work, we introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks. We further use a code-augmented LLM to ensure the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

salt-nlp/darg
noneOfficial

Videos

DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques