Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks

Rushang Karia; Daniel Bramblett; Daksh Dobhal; Siddharth Srivastava

arXiv:2410.08437·cs.AI·April 15, 2025

Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks

Rushang Karia, Daniel Bramblett, Daksh Dobhal, Siddharth Srivastava

PDF

Open Access 1 Video 3 Reviews

TL;DR

AutoEval introduces an automated, scalable benchmark for assessing LLMs on formal tasks like truth maintenance and reasoning, reducing reliance on human labels and static datasets.

Contribution

This paper presents AutoEval, the first benchmark enabling objective, automated evaluation of LLMs across increasing difficulty levels without human annotation.

Findings

01

AutoEval correlates well with other benchmarks.

02

It effectively mitigates overfitting to static datasets.

03

AutoEval can evaluate models of varying sophistication.

Abstract

This paper presents AutoEval, a novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness, such as truth maintenance in translation and logical reasoning. AutoEval is the first benchmarking paradigm that offers several key advantages necessary for scaling objective evaluation of LLMs without human labeling: (a) ability to evaluate LLMs of increasing sophistication by auto-generating tasks at different levels of difficulty; (b) auto-generation of ground truth that eliminates dependence on expensive and time-consuming human annotation; (c) the use of automatically generated, randomized datasets that mitigate the ability of successive LLMs to overfit to static datasets used in many contemporary benchmarks. Empirical analysis shows that an LLM's performance on AutoEval is highly indicative of its performance on a diverse array of…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. LLMs can be now tested on dynamic data generated by this framework for Formal Syntax 2. Their method has better indication of accuracy than other metrics for the LLM performance on various FOL. 3. Very good experimentations , dataset size and presentation

Weaknesses

1. Explanation on how the generated dataset is having diverse data points or quality is missing. Update: Rebuttal answered various points about this and mentioned they have added more information in appendix.

Reviewer 02Rating 6Confidence 3

Strengths

1. The article introduces the ∀UTO∃∨∧L benchmark as an innovative evaluation framework that can automatically assess the ability of large language models (LLMs) to maintain truthfulness in formal tasks. Its practicality lies in dynamically generating datasets, reducing the need for manual annotation, and providing an evaluation method without human intervention. 2. The ∀UTO∃∨∧L framework is designed to be flexible, allowing it to be extended to any formal grammar category that uses syntax and ac

Weaknesses

1. ∀UTO∃∨∧L relies on existing formal verifiers and parsing libraries to ensure evaluation accuracy. This means its reliability depends on the performance and accuracy of these tools, and any errors in them could affect the evaluation results of ∀UTO∃∨∧L. 2. While ∀UTO∃∨∧L reduces the need for manual annotation, generating dynamic datasets and performing formal verification may require significant computational resources, especially for large-scale datasets and complex logical structures.

Reviewer 03Rating 5Confidence 5

Strengths

- The use of CFGs to dynamically generate out-of-distribution datasets is a significant strength. - The use of formal verifiers to check the semantic equivalence of LLM outputs is a good method that avoids brittle syntactic checks. This allows for a more robust evaluation of truth maintenance across different formal syntaxes.

Weaknesses

- While the paper focuses on formal tasks like truth maintenance and logical reasoning, it remains unclear how well the proposed benchmark generalizes to less structured or more complex natural language tasks. The reliance on formal verifiers may limit its applicability to broader LLM use cases, such as creative writing or conversational AI, where formal correctness is less critical. - Informalization, the process of converting formal syntax into natural language, is inherently ambiguous. The pa

Videos

Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks· slideslive

Taxonomy

TopicsArtificial Intelligence in Law