Zero-Shot and Supervised Bird Image Segmentation Using Foundation Models: A Dual-Pipeline Approach with Grounding DINO~1.5, YOLOv11, and SAM~2.1
Abhinav Munagala

TL;DR
This paper introduces a dual-pipeline framework for bird image segmentation that leverages foundation models like SAM 2.1, Grounding DINO 1.5, and YOLOv11, achieving high accuracy in zero-shot and supervised settings without retraining for new species.
Contribution
The paper presents a novel dual-pipeline approach using foundation models for bird segmentation, demonstrating superior performance and zero-shot capabilities without retraining for new domains.
Findings
Supervised pipeline achieves IoU 0.912, Dice 0.954, F1 0.953 on CUB-200-2011.
Zero-shot pipeline achieves IoU 0.831 on the same benchmark.
Prompt-based foundation model pipelines outperform task-specific end-to-end trained models.
Abstract
Bird image segmentation remains a challenging task in computer vision due to extreme pose diversity, complex plumage patterns, and variable lighting conditions. This paper presents a dual-pipeline framework for binary bird image segmentation leveraging 2025 foundation models. We introduce two operating modes built upon Segment Anything Model 2.1 (SAM 2.1) as a shared frozen backbone: (1) a zero-shot pipeline using Grounding DINO 1.5 to detect birds via the text prompt "bird" before prompting SAM 2.1 with bounding boxes requiring no labelled bird data; and (2) a supervised pipeline that fine-tunes YOLOv11 on the CUB-200-2011 dataset for high-precision detection, again prompting SAM 2.1 for pixel-level masks. The segmentation model is never retrained for new species or domains. On CUB-200-2011 (11,788 images, 200 species), the supervised pipeline achieves IoU 0.912, Dice 0.954, and F1…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnimal Vocal Communication and Behavior · Advanced Neural Network Applications · Species Distribution and Climate Change
