AetherVision-Bench: An Open-Vocabulary RGB-Infrared Benchmark for Multi-Angle Segmentation across Aerial and Ground Perspectives

Aniruddh Sikdar; Aditya Gandhamal; Suresh Sundaram

arXiv:2506.03709·cs.CV·June 5, 2025

AetherVision-Bench: An Open-Vocabulary RGB-Infrared Benchmark for Multi-Angle Segmentation across Aerial and Ground Perspectives

Aniruddh Sikdar, Aditya Gandhamal, Suresh Sundaram

PDF

Open Access

TL;DR

This paper introduces AetherVision-Bench, a comprehensive benchmark for evaluating open-vocabulary multi-angle segmentation across aerial and ground views, addressing cross-domain generalization challenges in embodied AI systems.

Contribution

It presents a new benchmark for multi-angle segmentation in aerial and ground perspectives, enabling extensive evaluation of open-vocabulary models and their zero-shot transfer capabilities.

Findings

01

State-of-the-art OVSS models show limited cross-domain generalization.

02

The benchmark reveals key factors affecting model performance across views.

03

Insights from evaluations guide future improvements in embodied AI perception.

Abstract

Open-vocabulary semantic segmentation (OVSS) involves assigning labels to each pixel in an image based on textual descriptions, leveraging world models like CLIP. However, they encounter significant challenges in cross-domain generalization, hindering their practical efficacy in real-world applications. Embodied AI systems are transforming autonomous navigation for ground vehicles and drones by enhancing their perception abilities, and in this study, we present AetherVision-Bench, a benchmark for multi-angle segmentation across aerial, and ground perspectives, which facilitates an extensive evaluation of performance across different viewing angles and sensor modalities. We assess state-of-the-art OVSS models on the proposed benchmark and investigate the key factors that impact the performance of zero-shot transfer models. Our work pioneers the creation of a robustness benchmark,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning