Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

Sangwon Baik; Gunhee Kim; Mingi Choi; Hanbyul Joo

arXiv:2604.09781·cs.CV·April 14, 2026

Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

Sangwon Baik, Gunhee Kim, Mingi Choi, Hanbyul Joo

PDF

TL;DR

This paper introduces a closed-loop, inference-time approach using vision-language models for accurate text-guided 6D object pose rearrangement in 3D scenes, enhancing robotic manipulation capabilities.

Contribution

It presents a novel inference-time technique combining multi-view reasoning, object-centered visualization, and single-axis rotation prediction to improve 6D pose estimation without additional training.

Findings

01

Outperforms prior methods in text-guided 6D pose prediction.

02

Works effectively across different VLMs, both open-source and closed-source.

03

Enables more successful robot manipulation when integrated with motion planning.

Abstract

Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target object in a 3D scene. However, we find that with some inference-time techniques and iterative reasoning, VLMs can achieve dramatic performance gains. Concretely, given a 3D scene represented by an RGB-D image (or a compositional scene of 3D meshes) and a text instruction specifying a desired state change, we repeat the following loop: observe the current scene; evaluate whether it is faithful to the instruction; propose a pose update for the target object; apply the update; and render the updated scene. Through this closed-loop interaction, the VLM effectively acts as an agent. We further introduce three inference-time techniques that are essential to this closed-loop process: (i)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.