TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
Hanyu Zhou, Chuanhao Ma, Gim Hee Lee

TL;DR
TriRelVLA introduces a triadic relational framework for embodied manipulation, enhancing generalization across scenes, objects, and tasks by focusing on object-hand-task relations rather than appearance.
Contribution
It proposes a novel triadic relational structure and graph-based approach to improve transferability in vision-language-action robotic models.
Findings
Strong performance on fine-tuned tasks.
Significant gains in cross-scene generalization.
Effective relation-conditioned action generation.
Abstract
Vision-language-action (VLA) models perform well on training-seen robotic tasks but struggle to generalize to unseen scenes and objects. A key limitation lies in their implicit visual representations, which entangle object appearance, background, and scene layout. This makes policies sensitive to visual variations. Prior work improves transferability through structured intermediate representations that objectify visual content. However, these representations mainly capture scene semantics instead of action-relevant relations. As a result, action prediction remains tied to appearance statistics. We observe that manipulation actions depend on the object-hand-task relational structure, which governs interactions among task requirements, robot states, and object properties. Based on this observation, we propose TriRelVLA, a triadic relational VLA framework for generalizable embodied…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
