RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs

Logan Lawrence; Mustafa Chasmai; Rangel Daroya; Wuao Liu; Seoyun Jeong; Aaron Sun; Max Hamilton; Fabien Delattre; Oindrila Saha; Subhransu Maji; Grant Van Horn

arXiv:2603.27033·cs.CV·March 31, 2026

RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs

Logan Lawrence, Mustafa Chasmai, Rangel Daroya, Wuao Liu, Seoyun Jeong, Aaron Sun, Max Hamilton, Fabien Delattre, Oindrila Saha, Subhransu Maji, Grant Van Horn

PDF

TL;DR

RealBirdID introduces a benchmark for bird species identification that emphasizes the importance of abstaining with evidence-based rationales when images are unanswerable, revealing current model limitations.

Contribution

The paper presents the RealBirdID benchmark, focusing on abstention and rationale generation in fine-grained bird identification, highlighting challenges for existing models.

Findings

01

Species identification accuracy is below 13% on answerable unanswerable cases for current models.

02

Models with higher classification ability do not necessarily abstain more appropriately.

03

MLLMs often fail to provide correct reasons even when they abstain.

Abstract

Fine-grained bird species identification in the wild is frequently unanswerable from a single image: key cues may be non-visual (e.g. vocalization), or obscured due to occlusion, camera angle, or low resolution. Yet today's multimodal systems are typically judged on answerable, in-schema cases, encouraging confident guesses rather than principled abstention. We propose the RealBirdID benchmark: given an image of a bird, a system should either answer with a species or abstain with a concrete, evidence-based rationale: "requires vocalization," "low quality image," or "view obstructed". For each genus, the dataset includes a validation split composed of curated unanswerable examples with labeled rationales, paired with a companion set of clearly answerable instances. We find that (1) the species identification on the answerable set is challenging for a variety of open-source and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.