Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models
J. de Curt\`o, I. de Zarz\`a, Pablo Garc\'ia, Jordi Cabot

TL;DR
This study evaluates reasoning abilities of 15 foundation models across multiple platforms and domains, revealing insights into data quality importance and providing a scalable benchmark for future model assessment.
Contribution
It introduces a comprehensive, infrastructure-agnostic benchmark with 79 problems to evaluate reasoning in foundation models across diverse computational environments.
Findings
Training data quality outweighs model size in reasoning performance.
Benchmark results challenge conventional scaling assumptions.
Methodology enables tracking of reasoning capabilities over time.
Abstract
This paper presents a comprehensive cross-platform evaluation of reasoning capabilities in contemporary foundation models, establishing an infrastructure-agnostic benchmark across three computational paradigms: HPC supercomputing (MareNostrum 5), cloud platforms (Nebius AI Studio), and university clusters (a node with eight H200 GPUs). We evaluate 15 foundation models across 79 problems spanning eight academic domains (Physics, Mathematics, Chemistry, Economics, Biology, Statistics, Calculus, and Optimization) through three experimental phases: (1) Baseline establishment: Six models (Mixtral-8x7B, Phi-3, LLaMA 3.1-8B, Gemma-2-9b, Mistral-7B, OLMo-7B) evaluated on 19 problems using MareNostrum 5, establishing methodology and reference performance; (2) Infrastructure validation: The 19-problem benchmark repeated on university cluster (seven models including Falcon-Mamba state-space…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
