Point2Insert: Video Object Insertion via Sparse Point Guidance
Yu Zhou, Xiaoyan Yang, Bojia Zi, Lihan Zhang, Ruijie Sun, Weishi Zheng, Haibin Huang, Chi Zhang, Xuelong Li

TL;DR
Point2Insert is a novel framework enabling precise video object insertion using sparse points, reducing annotation effort and improving control over placement compared to mask-based methods.
Contribution
It introduces a sparse-point-based insertion model that requires minimal annotations and employs a two-stage training process with distillation from mask-guided models.
Findings
Outperforms strong baselines in object insertion tasks.
Achieves higher insertion success rate with fewer annotations.
Surpasses larger models in performance.
Abstract
This paper introduces Point2Insert, a sparse-point-based framework for flexible and user-friendly object insertion in videos, motivated by the growing popularity of accurate, low-effort object placement. Existing approaches face two major challenges: mask-based insertion methods require labor-intensive mask annotations, while instruction-based methods struggle to place objects at precise locations. Point2Insert addresses these issues by requiring only a small number of sparse points instead of dense masks, eliminating the need for tedious mask drawing. Specifically, it supports both positive and negative points to indicate regions that are suitable or unsuitable for insertion, enabling fine-grained spatial control over object locations. The training of Point2Insert consists of two stages. In Stage 1, we train an insertion model that generates objects in given regions conditioned on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning · Interactive and Immersive Displays
