How PARTs assemble into wholes: Learning the relative composition of images

Melika Ayoughi; Samira Abnar; Chen Huang; Chris Sandino; Sayeri Lala; Eeshan Gunesh Dhekane; Dan Busbridge; Shuangfei Zhai; Vimal Thilak; Josh Susskind; Pascal Mettes; Paul Groth; Hanlin Goh

arXiv:2506.03682·cs.CV·December 16, 2025

How PARTs assemble into wholes: Learning the relative composition of images

Melika Ayoughi, Samira Abnar, Chen Huang, Chris Sandino, Sayeri Lala, Eeshan Gunesh Dhekane, Dan Busbridge, Shuangfei Zhai, Vimal Thilak, Josh Susskind, Pascal Mettes, Paul Groth, Hanlin Goh

PDF

Open Access

TL;DR

This paper introduces PART, a self-supervised learning method that models continuous relative positions of image parts, improving spatial understanding and robustness over grid-based approaches across various tasks.

Contribution

PART leverages continuous relative transformations between off-grid patches, enabling more flexible and accurate modeling of object composition in images.

Findings

01

Outperforms grid-based methods like MAE and DropPos in object detection and time series prediction.

02

Maintains competitive performance on global classification tasks.

03

Applicable to diverse data types including EEG signals, medical imaging, video, and audio.

Abstract

The composition of objects and their parts, along with object-object positional relationships, provides a rich source of information for representation learning. Hence, spatial-aware pretext tasks have been actively explored in self-supervised learning. Existing works commonly start from a grid structure, where the goal of the pretext task involves predicting the absolute position index of patches within a fixed grid. However, grid-based approaches fall short of capturing the fluid and continuous nature of real-world object compositions. We introduce PART, a self-supervised learning approach that leverages continuous relative transformations between off-grid patches to overcome these limitations. By modeling how parts relate to each other in a continuous space, PART learns the relative composition of images-an off-grid structural relative positioning that is less tied to absolute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis