X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using   CLIP and StableDiffusion

Hanqing Zhao; Dianmo Sheng; Jianmin Bao; Dongdong Chen and; Dong Chen; Fang Wen; Lu Yuan; Ce Liu; Wenbo Zhou; Qi Chu and; Weiming Zhang; Nenghai Yu

arXiv:2212.03863·cs.CV·June 1, 2023·5 cites

X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion

Hanqing Zhao, Dianmo Sheng, Jianmin Bao, Dongdong Chen and, Dong Chen, Fang Wen, Lu Yuan, Ce Liu, Wenbo Zhou, Qi Chu and, Weiming Zhang, Nenghai Yu

PDF

Open Access 2 Repos

TL;DR

X-Paste leverages zero-shot recognition and text2image models to generate diverse training instances for scalable copy-paste data augmentation, significantly improving instance segmentation performance especially on long-tail classes.

Contribution

The paper introduces X-Paste, a scalable framework using CLIP and StableDiffusion for generating training data, enabling effective copy-paste augmentation without expensive manual annotations.

Findings

01

Achieves +2.6 box AP and +2.1 mask AP on all classes.

02

Attains +6.8 box AP and +6.5 mask AP on long-tail classes.

03

Demonstrates the feasibility of using text2image and zero-shot models for scalable data augmentation.

Abstract

Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous works utilize object instances either from human-annotated instance segmentation datasets or rendered from 3D object models, and both approaches are too expensive to scale up to obtain good diversity. In this paper, we revisit Copy-Paste at scale with the power of newly emerged zero-shot recognition models (e.g., CLIP) and text2image models (e.g., StableDiffusion). We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

Methodssimple Copy-Paste