Towards Scalable Training for Handwritten Mathematical Expression Recognition
Haoyang Li, Jiaqing Li, Jialun Cao, Zongyuan Yang, Yongping Xiong

TL;DR
This paper introduces TexTeller, a scalable approach to handwritten mathematical expression recognition that leverages a massive LaTeX-based dataset and mix-training, achieving state-of-the-art results.
Contribution
It develops a scalable data engine to generate a large LaTeX-based dataset and trains the first large-scale HMER model, TexTeller, significantly advancing the field.
Findings
Built the largest formula dataset, Tex80M, with over 80 million instances.
Achieved state-of-the-art performance across multiple benchmarks.
Enabled open release of dataset, model, and code for future research.
Abstract
Large foundation models have achieved significant performance gains through scalable training on massive datasets. However, the field of \textbf{H}andwritten \textbf{M}athematical \textbf{E}xpression \textbf{R}ecognition (HMER) has been impeded by the scarcity of data, primarily due to the arduous and costly process of manual annotation. To bridge this gap, we propose a novel method integrating limited handwritten formulas with large-scale LaTeX-rendered formulas by developing a scalable data engine to generate complex and consistent LaTeX sequences. With this engine, we built the largest formula dataset to date, termed \texttt{Tex80M}, comprising over 80 million high-quality training instances. Then we propose \texttt{TexTeller}, the first HMER model trained at scale, by mix-training \texttt{Tex80M} with a relatively small HME dataset. The expansive training dataset and our refined…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Mathematics, Computing, and Information Processing · Topic Modeling
