Self-supervision through Random Segments with Autoregressive Coding (RandSAC)
Tianyu Hua, Yonglong Tian, Sucheng Ren, Michalis Raptis, Hang Zhao,, Leonid Sigal

TL;DR
This paper introduces RandSAC, a novel self-supervised learning method for visual features that combines parallel and sequential predictions of image segments, inspired by NLP models, improving performance on multiple datasets.
Contribution
The paper proposes RandSAC, a new self-supervised training strategy for vision transformers that uses hierarchical segment grouping and combined autoregressive and parallel prediction mechanisms.
Findings
RandSAC improves feature learning performance on CIFAR and ImageNet datasets.
Randomized segment serialization enhances training effectiveness.
Skip-connections in the decoder further boost accuracy.
Abstract
Inspired by the success of self-supervised autoregressive representation learning in natural language (GPT and its variants), and advances in recent visual architecture design with Vision Transformers (ViTs), in this paper, we explore the effect various design choices have on the success of applying such training strategies for visual feature learning. Specifically, we introduce a novel strategy that we call Random Segments with Autoregressive Coding (RandSAC). In RandSAC, we group patch representations (image tokens) into hierarchically arranged segments; within each segment, tokens are predicted in parallel, similar to BERT, while across segment predictions are sequential, similar to GPT. We illustrate that randomized serialization of the segments significantly improves the performance and results in distribution over spatially-long (across-segments) and -short (within-segment)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Discriminative Fine-Tuning · Dropout
