STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing
Jiaru Zou, Qing Wang, Pratyush Thakur, Nickvash Kani

TL;DR
This paper introduces STEM-PoM, a benchmark dataset from real-world scientific documents to evaluate and improve large language models' understanding of math symbols in context, revealing significant performance gaps.
Contribution
STEM-PoM is the first comprehensive dataset for evaluating LLMs' reasoning with math symbols in scientific texts, aiding future model improvements.
Findings
State-of-the-art LLMs achieve 20-60% accuracy in symbol classification.
Fine-tuning improves accuracy to 50-60%.
Significant gap remains in LLMs' mathematical reasoning abilities.
Abstract
Advances in large language models (LLMs) have spurred research into enhancing their reasoning capabilities, particularly in math-rich STEM (Science, Technology, Engineering, and Mathematics) documents. While LLMs can generate equations or solve math-related queries, their ability to fully understand and interpret abstract mathematical symbols in long, math-rich documents remains limited. In this paper, we introduce STEM-PoM, a comprehensive benchmark dataset designed to evaluate LLMs' reasoning abilities on math symbols within contextual scientific text. The dataset, sourced from real-world ArXiv documents, contains over 2K math symbols classified as main attributes of variables, constants, operators, and unit descriptors, with additional sub-attributes including scalar/vector/matrix for variables and local/global/discipline-specific labels for both constants and operators. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing · Topic Modeling
