Mathematical Entities: Corpora and Benchmarks
Jacob Collard, Valeria de Paiva, Eswaran Subrahmanian

TL;DR
This paper introduces annotated corpora and benchmarks for mathematical language, evaluating NLP models' ability to process mathematical texts and highlighting the need for specialized adaptations.
Contribution
It provides large, annotated mathematical corpora, benchmarks for NLP tasks in mathematics, and a learning assistant, addressing the scarcity of resources in this domain.
Findings
Terminology extraction in mathematics is challenging.
Standard NLP models struggle with mathematical definitions.
Additional domain-specific adaptation is required for effective NLP in mathematics.
Abstract
Mathematics is a highly specialized domain with its own unique set of challenges. Despite this, there has been relatively little research on natural language processing for mathematical texts, and there are few mathematical language resources aimed at NLP. In this paper, we aim to provide annotated corpora that can be used to study the language of mathematics in different contexts, ranging from fundamental concepts found in textbooks to advanced research mathematics. We preprocess the corpora with a neural parsing model and some manual intervention to provide part-of-speech tags, lemmas, and dependency trees. In total, we provide 182397 sentences across three corpora. We then aim to test and evaluate several noteworthy natural language processing models using these corpora, to show how well they can adapt to the domain of mathematics and provide useful tools for exploring mathematical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing
MethodsSparse Evolutionary Training
