LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in   Low-Resource and Extinct Languages

Andrew M. Bean; Simi Hellsten; Harry Mayne; Jabez Magomere; Ethan A.; Chi; Ryan Chi; Scott A. Hale; Hannah Rose Kirk

arXiv:2406.06196·cs.CL·November 1, 2024·1 cites

LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages

Andrew M. Bean, Simi Hellsten, Harry Mayne, Jabez Magomere, Ethan A., Chi, Ryan Chi, Scott A. Hale, Hannah Rose Kirk

PDF

Open Access 1 Repo 1 Datasets

TL;DR

The LingOly benchmark evaluates large language models on complex reasoning tasks involving over 90 low-resource and extinct languages, revealing current models' limitations in multi-step linguistic reasoning and generalization.

Contribution

This paper introduces LingOly, a comprehensive benchmark for assessing advanced linguistic reasoning in low-resource and extinct languages using Olympiad puzzles.

Findings

01

Models perform poorly on high-difficulty problems.

02

Top models achieve only 38.7% accuracy on hardest tasks.

03

Higher-resource languages yield better model performance.

Abstract

In this paper, we present the LingOly benchmark, a novel benchmark for advanced reasoning abilities in large language models. Using challenging Linguistic Olympiad puzzles, we evaluate (i) capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages, and (ii) abilities to follow complex task instructions. The LingOly benchmark covers more than 90 mostly low-resource languages, minimising issues of data contamination, and contains 1,133 problems across 6 formats and 5 levels of human difficulty. We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation. Scores from 11 state-of-the-art LLMs demonstrate the benchmark to be challenging, and models perform poorly on the higher difficulty problems. On harder problems, even the top model only achieved 38.7% accuracy, a 24.7%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

am-bean/lingOly
noneOfficial

Datasets

ambean/lingOly
dataset· 6.1k dl
6.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques