Choose What to Observe: Task-Aware Semantic-Geometric Representations for Visuomotor Policy

Haoran Ding; Liang Ma; Yaxun Yang; Wen Yang; Tianyu Liu; Anqing Duan; Xiaodan Liang; Dezhen Song; Ivan Laptev; Yoshihiko Nakamura

arXiv:2603.07875·cs.RO·March 10, 2026

Choose What to Observe: Task-Aware Semantic-Geometric Representations for Visuomotor Policy

Haoran Ding, Liang Ma, Yaxun Yang, Wen Yang, Tianyu Liu, Anqing Duan, Xiaodan Liang, Dezhen Song, Ivan Laptev, Yoshihiko Nakamura

PDF

Open Access

TL;DR

This paper introduces a task-aware visual observation method that canonicalizes input images into semantic-geometric representations, significantly enhancing the robustness of visuomotor policies against appearance changes without retraining.

Contribution

It proposes a novel observation interface combining semantic segmentation and depth information to improve policy robustness to visual domain shifts.

Findings

01

Improved robustness to background and object recoloring shifts.

02

Maintains in-distribution performance across various benchmarks.

03

Effective with different policy architectures.

Abstract

Visuomotor policies learned from demonstrations often overfit to nuisance visual factors in raw RGB observations, resulting in brittle behavior under appearance shifts such as background changes and object recoloring. We propose a task-aware observation interface that canonicalizes visual input into a shared representation, improving robustness to out-of-distribution (OOD) appearance changes without modifying or fine-tuning the policy. Given an RGB image and an open-vocabulary specification of task-relevant entities, we use SAM3 to segment the target object and robot/gripper. We construct an L0 observation by repainting segmented entities with predefined semantic colors on a constant background. For tasks requiring stronger geometric cues, we further inject monocular depth from Depth Anything 3 into the segmented regions via depth-guided overwrite, yielding a unified semantic--geometric…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics