Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Jing Tan; Zhaoyang Zhang; Yantao Shen; Jiarui Cai; Shuo Yang; Jiajun Wu; Wei Xia; Zhuowen Tu; Stefano Soatto

arXiv:2601.02356·cs.CV·January 9, 2026

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Jing Tan, Zhaoyang Zhang, Yantao Shen, Jiarui Cai, Shuo Yang, Jiajun Wu, Wei Xia, Zhuowen Tu, Stefano Soatto

PDF

Open Access

TL;DR

Talk2Move is a reinforcement learning framework that enables precise, natural language-guided object transformations in scenes without requiring paired training data, advancing the capabilities of multimodal scene editing.

Contribution

It introduces a novel RL-based diffusion method with spatial rewards and active learning for text-guided object-level geometric transformations without paired supervision.

Findings

01

Outperforms existing methods in spatial accuracy

02

Achieves coherent and semantically faithful transformations

03

Demonstrates effectiveness on curated benchmarks

Abstract

We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications