ROI-Driven Foveated Attention for Unified Egocentric Representations in Vision-Language-Action Systems
Xinhai Sun, Xiang Shi, Menglin Zou, Wenlong Huang

TL;DR
This paper introduces a ROI-driven egocentric data representation for embodied AI, improving cross-embodiment transfer and reducing data collection costs by focusing on contact-critical regions in vision-language-action systems.
Contribution
It proposes a geometry-grounded, hand-centric ROI pipeline that enhances data reuse and transferability across different robots in embodied AI systems.
Findings
ROI preserves high local information density in contact regions
Pipeline enables cross-embodiment data sharing and transfer
Improves scalability of vision-language-action models
Abstract
The development of embodied AI systems is increasingly constrained by the availability and structure of physical interaction data. Despite recent advances in vision-language-action (VLA) models, current pipelines suffer from high data collection cost, limited cross-embodiment alignment, and poor transfer from internet-scale visual data to robot control. We propose a region-of-interest (ROI) driven engineering workflow that introduces an egocentric, geometry-grounded data representation. By projecting end-effector poses via forward kinematics (FK) into a single external camera, we derive movement-aligned hand-centric ROIs without requiring wrist-mounted cameras or multi-view systems. Unlike directly downsampling the full frame, ROI is cropped from the original image before resizing, preserving high local information density for contact-critical regions while retaining global context.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
