What Makes a Good Dataset for Symbol Description Reading?
Karol Lynch, Joern Ploennigs, Bradley Eck

TL;DR
This paper introduces a new dataset and methods for interpreting mathematical formulas by identifying symbols and their descriptions, advancing document understanding in mathematical contexts.
Contribution
It presents the MFQuAD dataset, novel noun phrase ranking variations, and insights on features that make an effective dataset for MIDR tasks.
Findings
MFQuAD dataset with 7508 annotated occurrences
State-of-the-art noun phrase ranking results
Insights into dataset features for MIDR
Abstract
The usage of mathematical formulas as concise representations of a document's key ideas is common practice. Correctly interpreting these formulas, by identifying mathematical symbols and extracting their descriptions, is an important task in document understanding. This paper makes the following contributions to the mathematical identifier description reading (MIDR) task: (i) introduces the Math Formula Question Answering Dataset (MFQuAD) with annotated identifier occurrences; (ii) describes novel variations of the noun phrase ranking approach for the MIDR task; (iii) reports experimental results for the SOTA noun phrase ranking approach and our novel variations of the approach, providing problem insights and a performance baseline; (iv) provides a position on the features that make an effective dataset for the MIDR task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Natural Language Processing Techniques · Handwritten Text Recognition Techniques
