MaP-AVR: A Meta-Action Planner for Agents Leveraging Vision Language Models and Retrieval-Augmented Generation

Zhenglong Guo; Yiming Zhao; Feng Jiang; Heng Jin; Zongbao Feng; Jianbin Zhou; Siyuan Xu

arXiv:2512.19453·cs.RO·December 23, 2025

MaP-AVR: A Meta-Action Planner for Agents Leveraging Vision Language Models and Retrieval-Augmented Generation

Zhenglong Guo, Yiming Zhao, Feng Jiang, Heng Jin, Zongbao Feng, Jianbin Zhou, Siyuan Xu

PDF

Open Access

TL;DR

MaP-AVR introduces a novel meta-action abstraction and leverages retrieval-augmented generation to improve robotic task planning with vision-language models, enhancing generalization and adaptability in complex environments.

Contribution

The paper proposes a new meta-action abstraction for robotic planning and integrates RAG for improved task execution, which are novel contributions in the field.

Findings

01

Demonstrates effective task completion with GPT-4o and OmniGibson.

02

Shows improved generalization over existing methods.

03

Validates the approach's promising performance.

Abstract

Embodied robotic AI systems designed to manage complex daily tasks rely on a task planner to understand and decompose high-level tasks. While most research focuses on enhancing the task-understanding abilities of LLMs/VLMs through fine-tuning or chain-of-thought prompting, this paper argues that defining the planned skill set is equally crucial. To handle the complexity of daily environments, the skill set should possess a high degree of generalization ability. Empirically, more abstract expressions tend to be more generalizable. Therefore, we propose to abstract the planned result as a set of meta-actions. Each meta-action comprises three components: {move/rotate, end-effector status change, relationship with the environment}. This abstraction replaces human-centric concepts, such as grasping or pushing, with the robot's intrinsic functionalities. As a result, the planned outcomes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Reinforcement Learning in Robotics