Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks
Rushang Karia, Daniel Bramblett, Daksh Dobhal, Siddharth Srivastava

TL;DR
AutoEval introduces an automated, scalable benchmark for assessing LLMs on formal tasks like truth maintenance and reasoning, reducing reliance on human labels and static datasets.
Contribution
This paper presents AutoEval, the first benchmark enabling objective, automated evaluation of LLMs across increasing difficulty levels without human annotation.
Findings
AutoEval correlates well with other benchmarks.
It effectively mitigates overfitting to static datasets.
AutoEval can evaluate models of varying sophistication.
Abstract
This paper presents AutoEval, a novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness, such as truth maintenance in translation and logical reasoning. AutoEval is the first benchmarking paradigm that offers several key advantages necessary for scaling objective evaluation of LLMs without human labeling: (a) ability to evaluate LLMs of increasing sophistication by auto-generating tasks at different levels of difficulty; (b) auto-generation of ground truth that eliminates dependence on expensive and time-consuming human annotation; (c) the use of automatically generated, randomized datasets that mitigate the ability of successive LLMs to overfit to static datasets used in many contemporary benchmarks. Empirical analysis shows that an LLM's performance on AutoEval is highly indicative of its performance on a diverse array of…
Peer Reviews
Decision·ICLR 2025 Poster
1. LLMs can be now tested on dynamic data generated by this framework for Formal Syntax 2. Their method has better indication of accuracy than other metrics for the LLM performance on various FOL. 3. Very good experimentations , dataset size and presentation
1. Explanation on how the generated dataset is having diverse data points or quality is missing. Update: Rebuttal answered various points about this and mentioned they have added more information in appendix.
1. The article introduces the ∀UTO∃∨∧L benchmark as an innovative evaluation framework that can automatically assess the ability of large language models (LLMs) to maintain truthfulness in formal tasks. Its practicality lies in dynamically generating datasets, reducing the need for manual annotation, and providing an evaluation method without human intervention. 2. The ∀UTO∃∨∧L framework is designed to be flexible, allowing it to be extended to any formal grammar category that uses syntax and ac
1. ∀UTO∃∨∧L relies on existing formal verifiers and parsing libraries to ensure evaluation accuracy. This means its reliability depends on the performance and accuracy of these tools, and any errors in them could affect the evaluation results of ∀UTO∃∨∧L. 2. While ∀UTO∃∨∧L reduces the need for manual annotation, generating dynamic datasets and performing formal verification may require significant computational resources, especially for large-scale datasets and complex logical structures.
- The use of CFGs to dynamically generate out-of-distribution datasets is a significant strength. - The use of formal verifiers to check the semantic equivalence of LLM outputs is a good method that avoids brittle syntactic checks. This allows for a more robust evaluation of truth maintenance across different formal syntaxes.
- While the paper focuses on formal tasks like truth maintenance and logical reasoning, it remains unclear how well the proposed benchmark generalizes to less structured or more complex natural language tasks. The reliance on formal verifiers may limit its applicability to broader LLM use cases, such as creative writing or conversational AI, where formal correctness is less critical. - Informalization, the process of converting formal syntax into natural language, is inherently ambiguous. The pa
Videos
Taxonomy
TopicsArtificial Intelligence in Law
