Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
Rotem Gatenyo, Ohad Fried

TL;DR
This paper introduces a zero-shot method for aligning 3D meshes based on language prompts, using differentiable rendering and geometric constraints, eliminating the need for training new models.
Contribution
It presents a novel optimization-based framework that combines vision-language cues with geometric constraints for 3D alignment without training, outperforming existing methods.
Findings
Outperforms baseline methods in alignment accuracy
Produces semantically faithful and physically plausible results
Curated a diverse benchmark for evaluation
Abstract
We study zero-shot 3D alignment of two given meshes, using a text prompt describing their spatial relation -- an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer, without training a new model. Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, and camera control concentrates the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Interactive and Immersive Displays · Robot Manipulation and Learning
