TL;DR
Datarus-R1-14B is an adaptive multi-step reasoning language model fine-tuned for automated data analysis, capable of complex problem solving with reasoning, code execution, and reflection, outperforming similar models on benchmarks.
Contribution
Introduces a dual reasoning interface with agentic and reflection modes, a novel training pipeline with synthetic data and hierarchical rewards, and achieves state-of-the-art performance on reasoning benchmarks.
Findings
Surpasses similar size models on reasoning benchmarks
Achieves up to 30% higher accuracy on AIME and LiveCodeBench
Emits 18-49% fewer tokens per solution
Abstract
We present Datarus-R1-14B, a 14 B-parameter open-weights language model fine-tuned from Qwen 2.5-14B-Instruct to act as a virtual data analyst and graduate-level problem solver. Datarus is trained not on isolated question-answer pairs but on full analytical trajectories including reasoning steps, code execution, error traces, self-corrections, and final conclusions, all captured in a ReAct-style notebook format spanning finance, medicine, numerical analysis, and other quantitative domains. Our training pipeline combines (i) a trajectory-centric synthetic data generator that yielded 144 000 tagged notebook episodes, (ii) a dual-reward framework blending a lightweight tag-based structural signal with a Hierarchical Reward Model (HRM) that scores both single-step soundness and end-to-end coherence, and (iii) a memory-optimized implementation of Group Relative Policy Optimization (GRPO)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
