RoCar: A Relationship Network-based Evaluation Method for Large Language   Models

Ming Wang; Wenfang Wu; Chongyun Gao; Daling Wang; Shi Feng; Yifei; Zhang

arXiv:2307.15997·cs.CL·November 12, 2024

RoCar: A Relationship Network-based Evaluation Method for Large Language Models

Ming Wang, Wenfang Wu, Chongyun Gao, Daling Wang, Shi Feng, Yifei, Zhang

PDF

Open Access 1 Repo

TL;DR

RoCar is a novel evaluation method for large language models that constructs random task graphs to assess reasoning and memory abilities, ensuring fairness by preventing models from having seen the tasks before.

Contribution

The paper introduces RoCar, a new evaluation approach using random task graphs to fairly and effectively assess LLM reasoning and memory capabilities.

Findings

01

Ensures fair evaluation by preventing prior exposure to tasks.

02

Effectively assesses reasoning and memory abilities of LLMs.

03

Uses random task graph construction for diverse evaluation scenarios.

Abstract

Large language models (LLMs) have received increasing attention. However, due to the complexity of its capabilities, how to rationally evaluate the capabilities of LLMs is still a task to be solved. We propose the RoCar method, which utilizes the defined basic schemas to randomly construct a task graph and generates natural language evaluation tasks based on the task graph to evaluate the reasoning and memory abilities of LLMs respectively. Due to the very large randomness of the task construction process, it is possible to ensure that none of the LLMs to be tested has directly learned the evaluation tasks, guaranteeing the fairness of the evaluation method.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

neu-datamining/rocar
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management

MethodsNone