Scene-agnostic Hierarchical Bimanual Task Planning via Visual Affordance Reasoning
Kwang Bin Lee, Jiho Kang, Sung-Hee Lee

TL;DR
This paper introduces a unified framework enabling embodied agents to plan and execute coordinated two-handed actions in unseen cluttered environments by reasoning about scene affordances and spatial relationships.
Contribution
It presents a novel scene-agnostic bimanual task planning system integrating visual grounding, subgoal reasoning, and structured prompting for coordinated manipulation.
Findings
Produces coherent, feasible two-handed plans in cluttered scenes
Generalizes to unseen environments without retraining
Demonstrates robust scene-agnostic affordance reasoning
Abstract
Embodied agents operating in open environments must translate high-level instructions into grounded, executable behaviors, often requiring coordinated use of both hands. While recent foundation models offer strong semantic reasoning, existing robotic task planners remain predominantly unimanual and fail to address the spatial, geometric, and coordination challenges inherent to bimanual manipulation in scene-agnostic settings. We present a unified framework for scene-agnostic bimanual task planning that bridges high-level reasoning with 3D-grounded two-handed execution. Our approach integrates three key modules. Visual Point Grounding (VPG) analyzes a single scene image to detect relevant objects and generate world-aligned interaction points. Bimanual Subgoal Planner (BSP) reasons over spatial adjacency and cross-object accessibility to produce compact, motion-neutralized subgoals that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Robotic Path Planning Algorithms
