TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

Hanyu Zhou; Chuanhao Ma; Gim Hee Lee

arXiv:2605.05714·cs.CV·May 8, 2026

TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

Hanyu Zhou, Chuanhao Ma, Gim Hee Lee

PDF

TL;DR

TriRelVLA introduces a triadic relational framework for embodied manipulation, enhancing generalization across scenes, objects, and tasks by focusing on object-hand-task relations rather than appearance.

Contribution

It proposes a novel triadic relational structure and graph-based approach to improve transferability in vision-language-action robotic models.

Findings

01

Strong performance on fine-tuned tasks.

02

Significant gains in cross-scene generalization.

03

Effective relation-conditioned action generation.

Abstract

Vision-language-action (VLA) models perform well on training-seen robotic tasks but struggle to generalize to unseen scenes and objects. A key limitation lies in their implicit visual representations, which entangle object appearance, background, and scene layout. This makes policies sensitive to visual variations. Prior work improves transferability through structured intermediate representations that objectify visual content. However, these representations mainly capture scene semantics instead of action-relevant relations. As a result, action prediction remains tied to appearance statistics. We observe that manipulation actions depend on the object-hand-task relational structure, which governs interactions among task requirements, robot states, and object properties. Based on this observation, we propose TriRelVLA, a triadic relational VLA framework for generalizable embodied…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.