Psychometric-Based Evaluation for Theorem Proving with Large Language   Models

Jianyu Zhang; Yongwang Zhao; Long Zhang; Jilin Hu; Xiaokun Luan,; Zhiwei Xu; Feng Yang

arXiv:2502.00855·cs.AI·February 4, 2025

Psychometric-Based Evaluation for Theorem Proving with Large Language Models

Jianyu Zhang, Yongwang Zhao, Long Zhang, Jilin Hu, Xiaokun Luan,, Zhiwei Xu, Feng Yang

PDF

Open Access

TL;DR

This paper introduces a psychometric-based evaluation framework for large language models in theorem proving, which grades theorem difficulty and adaptively selects test items to better reflect true performance and reduce evaluation costs.

Contribution

The study presents a novel psychometric-based evaluation method with dataset annotation and adaptive testing, improving accuracy and efficiency over traditional proof pass rate metrics.

Findings

01

Difficulty grading aligns with LLM perceptions of theorem complexity.

02

Adaptive evaluation reduces testing to 23% of the dataset while highlighting performance disparities.

03

The method effectively differentiates LLM capabilities in formal theorem proving.

Abstract

Large language models (LLMs) for formal theorem proving have become a prominent research focus. At present, the proving ability of these LLMs is mainly evaluated through proof pass rates on datasets such as miniF2F. However, this evaluation method overlooks the varying importance of theorems. As a result, it fails to highlight the real performance disparities between LLMs and leads to high evaluation costs. This study proposes a psychometric-based evaluation method for theorem proving with LLMs, comprising two main components: Dataset Annotation and Adaptive Evaluation. First, we propose a metric calculation method to annotate the dataset with difficulty and discrimination metrics. Specifically, we annotate each theorem in the miniF2F dataset and grade them into varying difficulty levels according to the performance of LLMs, resulting in an enhanced dataset: miniF2F-Graded. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling