SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection

Tianchonghui Fang; Yuan Zhuang; Fei Miao

arXiv:2605.11114·cs.RO·May 13, 2026

SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection

Tianchonghui Fang, Yuan Zhuang, Fei Miao

PDF

TL;DR

SEVO enhances the robustness of vision-language-action policies for robot manipulation across diverse environments by transforming camera input with active illumination, semantic segmentation, and diversified data collection, significantly improving transfer success.

Contribution

Introduces SEVO, a data-centric method that improves cross-environment manipulation robustness without changing policy architecture, emphasizing environmental diversity and observation design.

Findings

01

SEVO achieves 85% success in novel environments, outperforming baseline policies.

02

Diversified data collection is the key factor for generalization.

03

Principled observation design enables low-cost robots to operate reliably in household settings.

Abstract

Vision-Language-Action (VLA) and imitation-learning policies trained via community toolchains on low-cost hardware frequently fail when deployed outside the training environment. Existing evaluations, including the original ACT and SmolVLA benchmarks, demonstrate high success rates under controlled, fixed backgrounds, yet community practitioners report near-zero transfer to new environments. We present SEVO (Semantic-Enhanced Virtual Observation), a data-centric approach that improves cross-environment manipulation robustness without modifying the policy architecture. SEVO transforms the raw RGB camera stream through three mechanisms: (1) body-fixed cameras whose combined fields of view cover the full manipulation workspace, (2) active red-spectrum illumination that physically normalizes object appearance, and (3) real-time YOLO segmentation overlay that provides a background-invariant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.