ReSemAct: Advancing Fine-Grained Robotic Manipulation via Semantic Structuring and Affordance Refinement
Chenyu Su, Weiwei Shang, Chen Qian, Fei Zhang, Shuang Cong

TL;DR
ReSemAct introduces a unified framework leveraging semantic structuring and affordance refinement, enabling robots to perform fine-grained manipulation tasks more robustly in dynamic, real-world environments by integrating multimodal large language and vision models.
Contribution
The paper presents ReSemAct, a novel framework that combines semantic structuring and affordance refinement using foundation models for improved robotic manipulation.
Findings
ReSemAct achieves robust zero-shot manipulation in complex environments.
Semantic structuring improves the accuracy of affordance detection.
Refinement strategies enhance manipulation precision and adaptability.
Abstract
Fine-grained robotic manipulation requires grounding natural language into appropriate affordance targets. However, most existing methods driven by foundation models often compress rich semantics into oversimplified affordances, preventing exploitation of implicit semantic information. To address these challenges, we present ReSemAct, a novel unified manipulation framework that introduces Semantic Structuring and Affordance Refinement (SSAR), powered by the automated synergistic reasoning between Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs). Specifically, the Semantic Structuring module derives a unified semantic affordance description from natural language and RGB observations, organizing affordance regions, implicit functional intent, and coarse affordance anchors into a structured representation for downstream refinement. Building upon this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Robotic Path Planning Algorithms · Multimodal Machine Learning Applications
