TheoremQA: A Theorem-driven Question Answering dataset

Wenhu Chen; Ming Yin; Max Ku; Pan Lu; Yixin Wan; Xueguang Ma; Jianyu; Xu; Xinyi Wang; Tony Xia

arXiv:2305.12524·cs.CL·December 7, 2023·5 cites

TheoremQA: A Theorem-driven Question Answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu, Xu, Xinyi Wang, Tony Xia

PDF

Open Access 1 Repo 5 Datasets

TL;DR

TheoremQA is a new dataset designed to evaluate AI models' ability to apply theorems in solving complex science problems, revealing significant gaps in current open-source models compared to GPT-4.

Contribution

It introduces the first theorem-driven question-answering dataset with 800 questions covering 350 theorems across multiple domains, and evaluates various models' performance on this challenging benchmark.

Findings

01

GPT-4 achieves 51% accuracy with Program-of-Thoughts prompting.

02

Open-source models score below 15%, barely above random chance.

03

TheoremQA provides a comprehensive benchmark for challenging science problem-solving by AI.

Abstract

The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wenhuchen/theoremqa
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research

MethodsAttention Is All You Need · Absolute Position Encodings · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer · Label Smoothing · Multi-Head Attention · Adam