BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

Jiacheng Shen; Masato Hagiwara; Milad Alizadeh; Ellen Gilsenan-McMahon; Marius Miron; David Robinson; Emmanuel Chemla; Sara Keen; Gagan Narula; Mathieu Lauri\`ere; Matthieu Geist; Olivier Pietquin

arXiv:2604.16241·cs.CL·April 20, 2026

BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

Jiacheng Shen, Masato Hagiwara, Milad Alizadeh, Ellen Gilsenan-McMahon, Marius Miron, David Robinson, Emmanuel Chemla, Sara Keen, Gagan Narula, Mathieu Lauri\`ere, Matthieu Geist, Olivier Pietquin

PDF

TL;DR

BAGEL is a new benchmark designed to evaluate how well large language models understand specialized animal-related knowledge across various domains without external retrieval.

Contribution

It introduces a comprehensive, domain-specific benchmark constructed from diverse sources to analyze language models' animal knowledge and systematic strengths and weaknesses.

Findings

01

BAGEL enables detailed analysis of model performance across animal knowledge categories.

02

The benchmark reveals systematic failure modes in models' understanding of animal-related information.

03

BAGEL facilitates research on domain-specific knowledge generalization in language models.

Abstract

Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.