Stimulating Imagination: Towards General-purpose "Something Something Placement"
Jianyang Wu, Jie Gu, Xiaokang Ma, Fangzhou Qiu, Chu Tang, Jingmin Chen

TL;DR
This paper introduces SPORT, a method for general-purpose object placement that leverages large vision models and diffusion-based pose estimation to enable robots to rearrange objects based on vague instructions, with minimal training.
Contribution
SPORT combines vision reasoning and diffusion models for 3D object placement, reducing training needs and enabling zero-shot generalization to unseen objects.
Findings
Effective in generating 3D goal poses for unseen objects
Seamless transfer from simulation to real-world environments
No fine-tuning required for different object types
Abstract
General-purpose object placement is a fundamental capability of an intelligent generalist robot: being capable of rearranging objects following precise human instructions even in novel environments. This work is dedicated to achieving general-purpose object placement with ``something something'' instructions. Specifically, we break the entire process down into three parts, including object localization, goal imagination and robot control, and propose a method named SPORT. SPORT leverages a pre-trained large vision model for broad semantic reasoning about objects, and learns a diffusion-based pose estimator to ensure physically-realistic results in 3D space. Only object types (movable or reference) are communicated between these two parts, which brings two benefits. One is that we can fully leverage the powerful ability of open-set object recognition and localization since no specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Image Processing and 3D Reconstruction · Artificial Intelligence in Games
MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings
