Disambiguating Symbolic Expressions in Informal Documents
Dennis M\"uller, Cezary Kaliszyk

TL;DR
This paper introduces a new task of disambiguating symbolic expressions in informal LaTeX documents, presenting a dataset and a transformer-based approach that shows promising results despite limited data.
Contribution
The paper formulates the disambiguation of symbolic expressions as a neural translation task and provides a novel dataset along with a transformer-based methodology for this challenge.
Findings
Baseline models failed to produce valid LaTeX syntax.
Pre-trained transformer models improved disambiguation performance.
Evaluation techniques considering syntax and semantics enhanced model assessment.
Abstract
We propose the task of disambiguating symbolic expressions in informal STEM documents in the form of LaTeX files - that is, determining their precise semantics and abstract syntax tree - as a neural machine translation task. We discuss the distinct challenges involved and present a dataset with roughly 33,000 entries. We evaluated several baseline models on this dataset, which failed to yield even syntactically valid LaTeX before overfitting. Consequently, we describe a methodology using a transformer language model pre-trained on sources obtained from arxiv.org, which yields promising results despite the small size of the dataset. We evaluate our model using a plurality of dedicated techniques, taking the syntax and semantics of symbolic expressions into account.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Mathematics, Computing, and Information Processing
