Benchmarking Large Language Models with Integer Sequence Generation Tasks
Daniel O'Malley, Manish Bhattarai, Nishath Rajiv Ranasinghe, Erick Draayer, Javier Santos

TL;DR
This paper introduces a new benchmark using OEIS integer sequences to evaluate large language models' mathematical reasoning and code synthesis abilities, revealing current limitations and strengths.
Contribution
The paper presents a novel benchmark with a cheating detection mechanism to assess LLMs on integer sequence generation, highlighting the performance gap in complex reasoning tasks.
Findings
Reasoning-specialized models outperform others on complex sequences.
Overall performance on hard sequences remains poor.
Cheating detection effectively prevents memorization exploitation.
Abstract
We present a novel benchmark designed to rigorously evaluate the capabilities of large language models (LLMs) in mathematical reasoning and algorithmic code synthesis tasks. The benchmark comprises integer sequence generation tasks sourced from the Online Encyclopedia of Integer Sequences (OEIS), testing LLMs' abilities to accurately and efficiently generate Python code to compute these sequences without using lookup tables. Our comprehensive evaluation includes leading models from OpenAI (including the specialized reasoning-focused o-series), Anthropic, Meta, and Google across a carefully selected set of 1000 OEIS sequences categorized as ``easy'' or ``hard.'' Half of these sequences are classical sequences from the early days of OEIS and half were recently added to avoid contamination with the models' training data. To prevent models from exploiting memorized sequence values, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
