Multiple Object Stitching for Unsupervised Representation Learning

Chengchao Shen; Dawei Liu; Jianxin Wang

arXiv:2506.07364·cs.CV·June 10, 2025

Multiple Object Stitching for Unsupervised Representation Learning

Chengchao Shen, Dawei Liu, Jianxin Wang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces Multiple Object Stitching (MOS), a simple unsupervised method that enhances multi-object image representations by stitching single-object images, improving performance on complex downstream tasks without requiring human annotations.

Contribution

MOS is a novel stitching-based approach that refines unsupervised representations for multi-object images, leveraging object correspondences without human labels.

Findings

01

Achieves state-of-the-art unsupervised performance on ImageNet, CIFAR, and COCO.

02

Improves object detection and segmentation tasks.

03

Effective on both single-object and multi-object images.

Abstract

Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

1. The paper achieves state-of-the-art results on multiple benchmarks. 1. The idea is simple but effective.

Weaknesses

1. The motivation is unclear. The paper claims that contrastive learning for single object-centric images "suffer inferior performance on the widespread images with multiple objects", but it doesn't provide enough evidence to support the claim. For example, ImageNet-1K and CIFAR are recognition problems with single objects, why do we need to stitch multiple together? I understand that, for example, even in ImageNet-1K a lot of times images do contain multiple objects. If this is the case, I sugg

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

- The paper presents an innovative approach to unsupervised multi-object representation learning, which is an increasingly important area in computer vision. - The method's technique for multi-object image stitching through data augmentation, scaling, and tensor operations is efficient and leads to improved representation learning.

Weaknesses

- While the method excels in multi-object representations, it may not have been extensively tested in scenarios with highly dynamic or cluttered objects.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- Self-supervised learning, particularly learning from multi-object images, is an important problem. - The performance improvement is quite significant, both in image classification and object detection.

Weaknesses

**Movitation is not new** The issue of semantic inconsistency in contrastive learning has been discussed in several prior works [1-4]. These works are not properly cited, which may lead readers to overestimate the contribution of this paper. The efforts of prior work and the contributions of this paper should be clarified in the second paragraph of the introduction. [1] CASTing Your Model: Learning to Localize Improves Self-Supervised Representations. CVPR'21.\ [2] Unsupervised Object-Level Re

Code & Models

Repositories

visresearch/MultipleObjectStitching
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Face recognition and analysis · Advanced Neural Network Applications

MethodsSoftmax · Attention Is All You Need