TL;DR
This paper demonstrates that the order of image patches significantly impacts vision model performance and introduces REOrder, a method that learns optimal patch orderings to enhance accuracy on vision benchmarks.
Contribution
The paper proposes REOrder, a novel two-stage framework that learns task-specific patch orderings using information theory and reinforcement learning, improving vision model accuracy.
Findings
REOrder improves ImageNet-1K top-1 accuracy by up to 3.01%.
REOrder enhances Functional Map of the World accuracy by 13.35%.
Patch order significantly influences model performance in vision transformers.
Abstract
Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsREINFORCE
