EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages
Aman Sharma, Paras Chopra

TL;DR
EsoLang-Bench introduces a benchmark using esoteric programming languages to evaluate large language models' ability to generalize reasoning skills to out-of-distribution languages, revealing significant performance gaps.
Contribution
The paper presents a novel benchmark with five esoteric languages to assess LLMs' out-of-distribution reasoning capabilities in programming.
Findings
Frontier models achieve 100% accuracy on Python/JavaScript problems.
Esoteric language versions score only 0-11% accuracy.
Few-shot learning and self-reflection do not significantly improve performance.
Abstract
Large language models achieve near-ceiling performance on code generation benchmarks, yet most of the programming languages used by popular benchmarks such as SWE-bench and HumanEval (e.g. Python, JavaScript) are squarely in-distribution. They appear at scale in pre-training corpora and are heavily reinforced during post-training. To study LLM performance on unfamiliar programming languages, we introduce EsoLang-Bench, a benchmark using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare). All five of our chosen esoteric languages are Turing-complete, so the same algorithmic problems that are solvable in Python or JavaScript are in principle solvable in each of them. Yet, they are unfamiliar to LLMs which makes them a good proxy for evaluating out-of-distribution performance. The unfamiliarity of esoteric languages comprises of: (i) the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
