Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences
Dmitrii Korzh, Dmitrii Tarasov, Artyom Iudin, Elvir Karimov, Matvey Skripkin, Nikita Kuzmin, Andrey Kuznetsov, Oleg Y. Rogov, Ivan Oseledets

TL;DR
This paper introduces new models and a large open-source dataset for converting spoken mathematical expressions into LaTeX, improving accuracy and enabling multilingual and sentence-level recognition.
Contribution
It provides the first large-scale, multilingual dataset and benchmarks for speech-to-LaTeX conversion, along with novel models demonstrating significant performance improvements.
Findings
Models outperform previous approaches on the S2L-equations benchmark.
Achieved 40% CER on mathematical sentence recognition.
Demonstrated comparable results with audio language models on MathSpeech.
Abstract
Conversion of spoken mathematical expressions is a challenging task that involves transcribing speech into a strictly structured symbolic representation while addressing the ambiguity inherent in the pronunciation of equations. Although significant progress has been achieved in automatic speech recognition (ASR) and language models (LM), the problem of converting spoken mathematics into LaTeX remains underexplored. This task directly applies to educational and research domains, such as lecture transcription or note creation. Based on ASR post-correction, prior work requires 2 transcriptions, focuses only on isolated equations, has a limited test set, and provides neither training data nor multilingual coverage. To address these issues, we present the first fully open-source large-scale dataset, comprising over 66,000 human-annotated audio samples of mathematical equations and sentences…
Peer Reviews
Decision·ICLR 2026 Poster
1) This paper targets Speech-to-LaTeX, an underexplored but impactful task for education and research with large-scale S2L dataset. 2) It shows strong empirical results on S2L-equations (English). SALMONN achieves 17.5% CER, outperforming MathSpeech and Qwen; on S2L-sentences, SALMONN attains the best equation CER (39.7%).
1) Gap to real lecture conditions. Authors note the dataset does not capture paraphrases, incomplete expressions, or audio-visual coupling typical of classroom settings. 2) It is difficult to verify the reliability of the dataset presented in the paper. Although some sample data are available in the supplementary material, those samples are insufficient to establish whether the constructed dataset adequately covers diverse scenarios of spoken mathematical expressions. 3) Reproducibility risk f
- **Large, open S2L resource:** Releases a two-part dataset (S2L-equations, S2L-sentences) with multilingual coverage (English/Russian), mixing 66k human and 571k synthetic clips, collected from diverse sources and 33 annotators. This addresses the data bottleneck and standardizes evaluation. - **Clear task framing & thorough splits:** Uses disjoint-formula splits, human vs. TTS source splits, and mono vs. bilingual training setups—plus KaTeX-based equation normalization—to probe generalization
- **Compute/latency not discussed:** The best model (SALMONN-13B) is likely heavy; no throughput/latency/memory reporting limits practical takeaways for real-time use. - **Ambiguity handling is under-analyzed:** The work acknowledges inherent ambiguity (“one over x plus two”) but gives limited breakdowns by ambiguity type or guidance on disambiguating conventions during annotation/evaluation. - **Model behavior anomalies lack ablations:** 7B (LoRA-tuned, frozen base) underperforms fully fine-tun
1. The benchmark dataset proposed in this paper covers a wider range of diverse, real-world scenarios, making it a valuable contribution to the research community. 2. The paper provides a comprehensive exploration of the dataset’s potential applications, including post-transcription correction and end-to-end multimodal fine-tuning. 3. The dataset-splitting strategy is well designed, and the experimental results effectively demonstrate the generalization ability of different models when adapting
1. The technical novelty of the paper is limited. The work mainly reimplements common approaches to fine-tuning models without proposing new model architectures or training algorithms to address the issues identified in the experiments. 2. The Character Error Rate (CER) remains relatively high in many cases, and the paper lacks an in-depth analysis of the causes behind these failures. A more thorough investigation into the unexpected phenomena observed in the results would strengthen the contrib
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
