OTAS: Open-vocabulary Token Alignment for Outdoor Segmentation

Simon Schwaiger; Stefan Thalhammer; Wilfried W\"ober; Gerald Steinbauer-Wagner

arXiv:2507.08851·cs.RO·September 23, 2025

OTAS: Open-vocabulary Token Alignment for Outdoor Segmentation

Simon Schwaiger, Stefan Thalhammer, Wilfried W\"ober, Gerald Steinbauer-Wagner

PDF

Open Access

TL;DR

OTAS introduces a zero-shot, open-vocabulary outdoor segmentation method that extracts semantic structures directly from pre-trained vision models, enabling real-time, geometrically consistent 3D segmentation without scene-specific fine-tuning.

Contribution

The paper presents OTAS, a novel approach that leverages token alignment from pre-trained models for outdoor segmentation, overcoming limitations of object-centric priors in unstructured environments.

Findings

01

Achieves real-time performance (~17 fps) in outdoor segmentation.

02

Improves IoU by up to 151% on TartanAir dataset over existing methods.

03

Demonstrates applicability to real-world robotic deployment.

Abstract

Understanding open-world semantics is critical for robotic planning and control, particularly in unstructured outdoor environments. Existing vision-language mapping approaches typically rely on object-centric segmentation priors, which often fail outdoors due to semantic ambiguities and indistinct class boundaries. We propose OTAS - an Open-vocabulary Token Alignment method for outdoor Segmentation. OTAS addresses the limitations of open-vocabulary segmentation models by extracting semantic structure directly from the output tokens of pre-trained vision models. By clustering semantically similar structures across single and multiple views and grounding them in language, OTAS reconstructs a geometrically consistent feature field that supports open-vocabulary segmentation queries. Our method operates in a zero-shot manner, without scene-specific fine-tuning, and achieves real-time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques