What Makes a Good Dataset for Symbol Description Reading?

Karol Lynch; Joern Ploennigs; Bradley Eck

arXiv:2304.08352·cs.CL·April 18, 2023·1 cites

What Makes a Good Dataset for Symbol Description Reading?

Karol Lynch, Joern Ploennigs, Bradley Eck

PDF

Open Access

TL;DR

This paper introduces a new dataset and methods for interpreting mathematical formulas by identifying symbols and their descriptions, advancing document understanding in mathematical contexts.

Contribution

It presents the MFQuAD dataset, novel noun phrase ranking variations, and insights on features that make an effective dataset for MIDR tasks.

Findings

01

MFQuAD dataset with 7508 annotated occurrences

02

State-of-the-art noun phrase ranking results

03

Insights into dataset features for MIDR

Abstract

The usage of mathematical formulas as concise representations of a document's key ideas is common practice. Correctly interpreting these formulas, by identifying mathematical symbols and extracting their descriptions, is an important task in document understanding. This paper makes the following contributions to the mathematical identifier description reading (MIDR) task: (i) introduces the Math Formula Question Answering Dataset (MFQuAD) with $7508$ annotated identifier occurrences; (ii) describes novel variations of the noun phrase ranking approach for the MIDR task; (iii) reports experimental results for the SOTA noun phrase ranking approach and our novel variations of the approach, providing problem insights and a performance baseline; (iv) provides a position on the features that make an effective dataset for the MIDR task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Natural Language Processing Techniques · Handwritten Text Recognition Techniques