Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control
Shunlei Li, Longsen Gao, Jin Wang, Chang Che, Xi Xiao, Jiuwen Cao, Yingbai Hu, Hamid Reza Karimi

TL;DR
This paper introduces GF-VLA, a novel framework that uses information-theoretic scene graphs and language-conditioned transformers to enable dual-arm robots to perform complex tasks from human videos, improving generalization and interpretability.
Contribution
The paper presents a new graph-fused vision-language-action model that enhances robotic policy reasoning and control from human demonstrations, with a focus on dual-arm manipulation and task generalization.
Findings
Achieves over 95% graph accuracy and 93% subtask segmentation.
Yields 94% grasp success and 89% placement accuracy in experiments.
Demonstrates strong generalization across diverse tasks and spatial configurations.
Abstract
Teaching robots dexterous skills from human videos remains challenging due to the reliance on low-level trajectory imitation, which fails to generalize across object types, spatial layouts, and manipulator configurations. We propose Graph-Fused Vision-Language-Action (GF-VLA), a framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB and Depth human demonstrations. GF-VLA first extracts Shannon-information-based cues to identify hands and objects with the highest task relevance, then encodes these cues into temporally ordered scene graphs that capture both hand-object and object-object interactions. These graphs are fused with a language-conditioned transformer that generates hierarchical behavior trees and interpretable Cartesian motion commands. To improve execution efficiency in bimanual settings, we further introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
