Point2Insert: Video Object Insertion via Sparse Point Guidance

Yu Zhou; Xiaoyan Yang; Bojia Zi; Lihan Zhang; Ruijie Sun; Weishi Zheng; Haibin Huang; Chi Zhang; Xuelong Li

arXiv:2602.04167·cs.CV·February 5, 2026

Point2Insert: Video Object Insertion via Sparse Point Guidance

Yu Zhou, Xiaoyan Yang, Bojia Zi, Lihan Zhang, Ruijie Sun, Weishi Zheng, Haibin Huang, Chi Zhang, Xuelong Li

PDF

Open Access

TL;DR

Point2Insert is a novel framework enabling precise video object insertion using sparse points, reducing annotation effort and improving control over placement compared to mask-based methods.

Contribution

It introduces a sparse-point-based insertion model that requires minimal annotations and employs a two-stage training process with distillation from mask-guided models.

Findings

01

Outperforms strong baselines in object insertion tasks.

02

Achieves higher insertion success rate with fewer annotations.

03

Surpasses larger models in performance.

Abstract

This paper introduces Point2Insert, a sparse-point-based framework for flexible and user-friendly object insertion in videos, motivated by the growing popularity of accurate, low-effort object placement. Existing approaches face two major challenges: mask-based insertion methods require labor-intensive mask annotations, while instruction-based methods struggle to place objects at precise locations. Point2Insert addresses these issues by requiring only a small number of sparse points instead of dense masks, eliminating the need for tedious mask drawing. Specifically, it supports both positive and negative points to indicate regions that are suitable or unsuitable for insertion, enabling fine-grained spatial control over object locations. The training of Point2Insert consists of two stages. In Stage 1, we train an insertion model that generates objects in given regions conditioned on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning · Interactive and Immersive Displays