Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation
Penghui Ruan, Bojia Zi, Xianbiao Qi, Youze Huang, Rong Xiao, Pichao Wang, Jiannong Cao, Yuhui Shi

TL;DR
Ctrl&Shift is an innovative diffusion-based framework enabling high-quality, geometry-aware object manipulation in images and videos, achieving scene realism, viewpoint consistency, and user control without explicit 3D models.
Contribution
The paper introduces a novel end-to-end diffusion approach that unifies geometric control and real-world generalization for object manipulation without explicit 3D reconstruction.
Findings
Achieves state-of-the-art fidelity and viewpoint consistency.
Demonstrates superior controllability over existing methods.
Introduces a scalable dataset construction pipeline for real-world data.
Abstract
Object-level manipulation, relocating or reorienting objects in images or videos while preserving scene realism, is central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present Ctrl&Shift, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stages, object removal and reference-guided inpainting under explicit camera pose control, and encode both within a unified diffusion process. To enable…
Peer Reviews
Decision·ICLR 2026 Poster
1. Innovative decomposition of object manipulation into removal and pose-controlled inpainting within a unified diffusion model, enabling fine-grained geometric control (e.g., precise relocation and rotation) without relying on explicit 3D representations like NeRF or Gaussians, which improves scalability and avoids per-scene optimization. 2. The multi-task training approach effectively disentangles conditioning signals (background, object identity, camera pose), leading to interpretable and con
Heavy Reliance on Data Synthesis Pipeline: The method's performance is heavily dependent on the quality of the synthesized training data generated by the multi-step pipeline (mesh reconstruction → pose estimation → harmonization). Errors or artifacts introduced at any of these stages could propagate into the final model, potentially limiting its performance on objects or scenes that are challenging for these upstream components (e.g., transparent objects, complex textures, artistic images) and a
- Empirical Results and High-Quality Output. The quantitative results are compelling. On the new GeoEditBench (Table 2), Ctrl&Shift shows substantial improvements in geometric accuracy (17.70% Pose MAPE vs. 24.36% for the next best) while also achieving the highest fidelity scores (PSNR, DreamSim). The qualitative comparisons (Figure 4) are impressive, clearly demonstrating the model's superiority in handling complex rotations and perspective shifts where competitors fail. - Significant Enablin
Several aspects require clarification or further investigation. 1. Reliance on 3D Supervision during Training and Generalization Limits. The paper emphasizes avoiding explicit 3D representations at inference. However, the training data generation (Section 2.5) heavily relies on explicit 3D reconstruction (Hunyuan3D-2) and differentiable rendering. The model's generalization is therefore constrained by the capabilities of the underlying 3D reconstruction method. Furthermore, the rigorous filteri
1. Conceptual Innovation: The paper's primary strength lies in its core idea. It represents a conceptual shift: instead of relying on expensive or unstable 3D reconstruction (like NeRF or Mesh) at inference time, it injects precise geometric control (a relative pose vector) as a condition into the 2D diffusion process. This is a very clever decoupling that elegantly combines the advantages of both domains. Systematic Framework Design: The Ctrl&Shift architecture is designed with systematic and
I agree with the authors that this is excellent and inspiring work. To make the paper more complete and rigorous, I strongly recommend the authors add a 'Limitations and Future Work' discussion to the final version (e.g., in the conclusion or appendix). I would like the authors to specifically address the following points: From Technical Controllability to Practical Usability: The authors should discuss the challenge of mapping intuitive user interactions (e.g., 2D mouse drags, rotational gest
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning
