Multiple Object Stitching for Unsupervised Representation Learning
Chengchao Shen, Dawei Liu, Jianxin Wang

TL;DR
This paper introduces Multiple Object Stitching (MOS), a simple unsupervised method that enhances multi-object image representations by stitching single-object images, improving performance on complex downstream tasks without requiring human annotations.
Contribution
MOS is a novel stitching-based approach that refines unsupervised representations for multi-object images, leveraging object correspondences without human labels.
Findings
Achieves state-of-the-art unsupervised performance on ImageNet, CIFAR, and COCO.
Improves object detection and segmentation tasks.
Effective on both single-object and multi-object images.
Abstract
Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object…
Peer Reviews
Decision·Submitted to ICLR 2024
1. The paper achieves state-of-the-art results on multiple benchmarks. 1. The idea is simple but effective.
1. The motivation is unclear. The paper claims that contrastive learning for single object-centric images "suffer inferior performance on the widespread images with multiple objects", but it doesn't provide enough evidence to support the claim. For example, ImageNet-1K and CIFAR are recognition problems with single objects, why do we need to stitch multiple together? I understand that, for example, even in ImageNet-1K a lot of times images do contain multiple objects. If this is the case, I sugg
- The paper presents an innovative approach to unsupervised multi-object representation learning, which is an increasingly important area in computer vision. - The method's technique for multi-object image stitching through data augmentation, scaling, and tensor operations is efficient and leads to improved representation learning.
- While the method excels in multi-object representations, it may not have been extensively tested in scenarios with highly dynamic or cluttered objects.
- Self-supervised learning, particularly learning from multi-object images, is an important problem. - The performance improvement is quite significant, both in image classification and object detection.
**Movitation is not new** The issue of semantic inconsistency in contrastive learning has been discussed in several prior works [1-4]. These works are not properly cited, which may lead readers to overestimate the contribution of this paper. The efforts of prior work and the contributions of this paper should be clarified in the second paragraph of the introduction. [1] CASTing Your Model: Learning to Localize Improves Self-Supervised Representations. CVPR'21.\ [2] Unsupervised Object-Level Re
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Face recognition and analysis · Advanced Neural Network Applications
MethodsSoftmax · Attention Is All You Need
