EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

Aman Sharma; Paras Chopra

arXiv:2603.09678·cs.AI·May 13, 2026

EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

Aman Sharma, Paras Chopra

PDF

TL;DR

EsoLang-Bench introduces a benchmark using esoteric programming languages to evaluate large language models' ability to generalize reasoning skills to out-of-distribution languages, revealing significant performance gaps.

Contribution

The paper presents a novel benchmark with five esoteric languages to assess LLMs' out-of-distribution reasoning capabilities in programming.

Findings

01

Frontier models achieve 100% accuracy on Python/JavaScript problems.

02

Esoteric language versions score only 0-11% accuracy.

03

Few-shot learning and self-reflection do not significantly improve performance.

Abstract

Large language models achieve near-ceiling performance on code generation benchmarks, yet most of the programming languages used by popular benchmarks such as SWE-bench and HumanEval (e.g. Python, JavaScript) are squarely in-distribution. They appear at scale in pre-training corpora and are heavily reinforced during post-training. To study LLM performance on unfamiliar programming languages, we introduce EsoLang-Bench, a benchmark using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare). All five of our chosen esoteric languages are Turing-complete, so the same algorithmic problems that are solvable in Python or JavaScript are in principle solvable in each of them. Yet, they are unfamiliar to LLMs which makes them a good proxy for evaluating out-of-distribution performance. The unfamiliarity of esoteric languages comprises of: (i) the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.