SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection
Tianchonghui Fang, Yuan Zhuang, Fei Miao

TL;DR
SEVO enhances the robustness of vision-language-action policies for robot manipulation across diverse environments by transforming camera input with active illumination, semantic segmentation, and diversified data collection, significantly improving transfer success.
Contribution
Introduces SEVO, a data-centric method that improves cross-environment manipulation robustness without changing policy architecture, emphasizing environmental diversity and observation design.
Findings
SEVO achieves 85% success in novel environments, outperforming baseline policies.
Diversified data collection is the key factor for generalization.
Principled observation design enables low-cost robots to operate reliably in household settings.
Abstract
Vision-Language-Action (VLA) and imitation-learning policies trained via community toolchains on low-cost hardware frequently fail when deployed outside the training environment. Existing evaluations, including the original ACT and SmolVLA benchmarks, demonstrate high success rates under controlled, fixed backgrounds, yet community practitioners report near-zero transfer to new environments. We present SEVO (Semantic-Enhanced Virtual Observation), a data-centric approach that improves cross-environment manipulation robustness without modifying the policy architecture. SEVO transforms the raw RGB camera stream through three mechanisms: (1) body-fixed cameras whose combined fields of view cover the full manipulation workspace, (2) active red-spectrum illumination that physically normalizes object appearance, and (3) real-time YOLO segmentation overlay that provides a background-invariant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
