Psychometric-Based Evaluation for Theorem Proving with Large Language Models
Jianyu Zhang, Yongwang Zhao, Long Zhang, Jilin Hu, Xiaokun Luan,, Zhiwei Xu, Feng Yang

TL;DR
This paper introduces a psychometric-based evaluation framework for large language models in theorem proving, which grades theorem difficulty and adaptively selects test items to better reflect true performance and reduce evaluation costs.
Contribution
The study presents a novel psychometric-based evaluation method with dataset annotation and adaptive testing, improving accuracy and efficiency over traditional proof pass rate metrics.
Findings
Difficulty grading aligns with LLM perceptions of theorem complexity.
Adaptive evaluation reduces testing to 23% of the dataset while highlighting performance disparities.
The method effectively differentiates LLM capabilities in formal theorem proving.
Abstract
Large language models (LLMs) for formal theorem proving have become a prominent research focus. At present, the proving ability of these LLMs is mainly evaluated through proof pass rates on datasets such as miniF2F. However, this evaluation method overlooks the varying importance of theorems. As a result, it fails to highlight the real performance disparities between LLMs and leads to high evaluation costs. This study proposes a psychometric-based evaluation method for theorem proving with LLMs, comprising two main components: Dataset Annotation and Adaptive Evaluation. First, we propose a metric calculation method to annotate the dataset with difficulty and discrimination metrics. Specifically, we annotate each theorem in the miniF2F dataset and grade them into varying difficulty levels according to the performance of LLMs, resulting in an enhanced dataset: miniF2F-Graded. Experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling
