RoMathExam: A Longitudinal Dataset of Romanian Math Exams (1895-2025) with a Seven-Decade Core (1957-2025)
Luca-Ncolae Cuclea, Sabin-Codrut Badea, Adrian-Marius Dumitran

TL;DR
RoMathExam is a comprehensive, longitudinal dataset of Romanian math exams from 1895 to 2025, enabling research in difficulty modeling, curriculum analysis, and language-specific AI evaluation.
Contribution
It introduces a large, structured Romanian math exam dataset with curriculum tags, embeddings, and a novel complexity metric validated across multiple AI models.
Findings
High correlation of the complexity metric across models (r > 0.72).
Identified a shift from diverse historical formats to a standardized modern curriculum.
Demonstrated the dataset's utility in longitudinal curriculum and difficulty analysis.
Abstract
AI in Education research increasingly relies on authentic, curriculum-grounded assessment data, yet large, well-structured exam corpora remain scarce for many languages and educational systems. We introduce RoMathExam, a longitudinal dataset of Romanian high-school mathematics exams spanning 1895-2025, with a robust standardized core for 1957-2025. The dataset contains 10,592 mathematics problems organized into 600+ complete exam sets across multiple tracks (M1-M4), covering both official national examination sessions and ministry-published training variants. Beyond high-fidelity digitization and a unified JSON schema with traceable provenance, RoMathExam is enriched with curriculum-aligned topic tags and dense text embeddings, enabling variant detection, deduplication, and similarity-based retrieval. To overcome the lack of historical psychometric data, we propose and validate a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
