RoCar: A Relationship Network-based Evaluation Method for Large Language Models
Ming Wang, Wenfang Wu, Chongyun Gao, Daling Wang, Shi Feng, Yifei, Zhang

TL;DR
RoCar is a novel evaluation method for large language models that constructs random task graphs to assess reasoning and memory abilities, ensuring fairness by preventing models from having seen the tasks before.
Contribution
The paper introduces RoCar, a new evaluation approach using random task graphs to fairly and effectively assess LLM reasoning and memory capabilities.
Findings
Ensures fair evaluation by preventing prior exposure to tasks.
Effectively assesses reasoning and memory abilities of LLMs.
Uses random task graph construction for diverse evaluation scenarios.
Abstract
Large language models (LLMs) have received increasing attention. However, due to the complexity of its capabilities, how to rationally evaluate the capabilities of LLMs is still a task to be solved. We propose the RoCar method, which utilizes the defined basic schemas to randomly construct a task graph and generates natural language evaluation tasks based on the task graph to evaluate the reasoning and memory abilities of LLMs respectively. Due to the very large randomness of the task construction process, it is possible to ensure that none of the LLMs to be tested has directly learned the evaluation tasks, guaranteeing the fairness of the evaluation method.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management
MethodsNone
