Stimulating Imagination: Towards General-purpose "Something Something Placement"

Jianyang Wu; Jie Gu; Xiaokang Ma; Fangzhou Qiu; Chu Tang; Jingmin Chen

arXiv:2408.01655·cs.RO·July 22, 2025

Stimulating Imagination: Towards General-purpose "Something Something Placement"

Jianyang Wu, Jie Gu, Xiaokang Ma, Fangzhou Qiu, Chu Tang, Jingmin Chen

PDF

Open Access

TL;DR

This paper introduces SPORT, a method for general-purpose object placement that leverages large vision models and diffusion-based pose estimation to enable robots to rearrange objects based on vague instructions, with minimal training.

Contribution

SPORT combines vision reasoning and diffusion models for 3D object placement, reducing training needs and enabling zero-shot generalization to unseen objects.

Findings

01

Effective in generating 3D goal poses for unseen objects

02

Seamless transfer from simulation to real-world environments

03

No fine-tuning required for different object types

Abstract

General-purpose object placement is a fundamental capability of an intelligent generalist robot: being capable of rearranging objects following precise human instructions even in novel environments. This work is dedicated to achieving general-purpose object placement with ``something something'' instructions. Specifically, we break the entire process down into three parts, including object localization, goal imagination and robot control, and propose a method named SPORT. SPORT leverages a pre-trained large vision model for broad semantic reasoning about objects, and learns a diffusion-based pose estimator to ensure physically-realistic results in 3D space. Only object types (movable or reference) are communicated between these two parts, which brings two benefits. One is that we can fully leverage the powerful ability of open-set object recognition and localization since no specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Image Processing and 3D Reconstruction · Artificial Intelligence in Games

MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings