MathBridge: A Large Corpus Dataset for Translating Spoken Mathematical Expressions into $LaTeX$ Formulas for Improved Readability
Kyudan Jung, Sieun Hyeon, Jeong Youn Kwon, Nam-Joon Kim, Hyun Gon Ryu,, Hyuk-Jae Lee, Jaeyoung Do

TL;DR
MathBridge is a large dataset of 23 million paired spoken mathematical sentences and LaTeX formulas, enabling improved training of models to convert spoken math into readable formulas, significantly boosting performance.
Contribution
The paper introduces MathBridge, the first extensive dataset for translating spoken math sentences into LaTeX formulas, facilitating better model training and conversion accuracy.
Findings
Fine-tuning with MathBridge greatly improves LaTeX conversion scores.
T5-large model's sacreBLEU score increased from 4.77 to 46.8.
Dataset enables substantial enhancement of pretrained language models.
Abstract
Improving the readability of mathematical expressions in text-based document such as subtitle of mathematical video, is an significant task. To achieve this, mathematical expressions should be convert to compiled formulas. For instance, the spoken expression ``x equals minus b plus or minus the square root of b squared minus four a c, all over two a'' from automatic speech recognition is more readily comprehensible when displayed as a compiled formula . To convert mathematical spoken sentences to compiled formulas, two processes are required: spoken sentences are converted into LaTeX formulas, and LaTeX formulas are converted into compiled formulas. The latter can be managed by using LaTeX engines. However, there is no way to do the former effectively. Even if we try to solve this using language models, there is no paired data between spoken…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Mathematics, Computing, and Information Processing · Natural Language Processing Techniques
