TheoremQA: A Theorem-driven Question Answering dataset
Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu, Xu, Xinyi Wang, Tony Xia

TL;DR
TheoremQA is a new dataset designed to evaluate AI models' ability to apply theorems in solving complex science problems, revealing significant gaps in current open-source models compared to GPT-4.
Contribution
It introduces the first theorem-driven question-answering dataset with 800 questions covering 350 theorems across multiple domains, and evaluates various models' performance on this challenging benchmark.
Findings
GPT-4 achieves 51% accuracy with Program-of-Thoughts prompting.
Open-source models score below 15%, barely above random chance.
TheoremQA provides a comprehensive benchmark for challenging science problem-solving by AI.
Abstract
The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
MethodsAttention Is All You Need · Absolute Position Encodings · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer · Label Smoothing · Multi-Head Attention · Adam
