ROI-Driven Foveated Attention for Unified Egocentric Representations in Vision-Language-Action Systems

Xinhai Sun; Xiang Shi; Menglin Zou; Wenlong Huang

arXiv:2603.20668·cs.RO·March 24, 2026

ROI-Driven Foveated Attention for Unified Egocentric Representations in Vision-Language-Action Systems

Xinhai Sun, Xiang Shi, Menglin Zou, Wenlong Huang

PDF

Open Access

TL;DR

This paper introduces a ROI-driven egocentric data representation for embodied AI, improving cross-embodiment transfer and reducing data collection costs by focusing on contact-critical regions in vision-language-action systems.

Contribution

It proposes a geometry-grounded, hand-centric ROI pipeline that enhances data reuse and transferability across different robots in embodied AI systems.

Findings

01

ROI preserves high local information density in contact regions

02

Pipeline enables cross-embodiment data sharing and transfer

03

Improves scalability of vision-language-action models

Abstract

The development of embodied AI systems is increasingly constrained by the availability and structure of physical interaction data. Despite recent advances in vision-language-action (VLA) models, current pipelines suffer from high data collection cost, limited cross-embodiment alignment, and poor transfer from internet-scale visual data to robot control. We propose a region-of-interest (ROI) driven engineering workflow that introduces an egocentric, geometry-grounded data representation. By projecting end-effector poses via forward kinematics (FK) into a single external camera, we derive movement-aligned hand-centric ROIs without requiring wrist-mounted cameras or multi-view systems. Unlike directly downsampling the full frame, ROI is cropped from the original image before resizing, preserving high local information density for contact-critical regions while retaining global context.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI