OTAS: Open-vocabulary Token Alignment for Outdoor Segmentation
Simon Schwaiger, Stefan Thalhammer, Wilfried W\"ober, Gerald Steinbauer-Wagner

TL;DR
OTAS introduces a zero-shot, open-vocabulary outdoor segmentation method that extracts semantic structures directly from pre-trained vision models, enabling real-time, geometrically consistent 3D segmentation without scene-specific fine-tuning.
Contribution
The paper presents OTAS, a novel approach that leverages token alignment from pre-trained models for outdoor segmentation, overcoming limitations of object-centric priors in unstructured environments.
Findings
Achieves real-time performance (~17 fps) in outdoor segmentation.
Improves IoU by up to 151% on TartanAir dataset over existing methods.
Demonstrates applicability to real-world robotic deployment.
Abstract
Understanding open-world semantics is critical for robotic planning and control, particularly in unstructured outdoor environments. Existing vision-language mapping approaches typically rely on object-centric segmentation priors, which often fail outdoors due to semantic ambiguities and indistinct class boundaries. We propose OTAS - an Open-vocabulary Token Alignment method for outdoor Segmentation. OTAS addresses the limitations of open-vocabulary segmentation models by extracting semantic structure directly from the output tokens of pre-trained vision models. By clustering semantically similar structures across single and multiple views and grounding them in language, OTAS reconstructs a geometrically consistent feature field that supports open-vocabulary segmentation queries. Our method operates in a zero-shot manner, without scene-specific fine-tuning, and achieves real-time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
