RGB no more: Minimally-decoded JPEG Vision Transformers

Jeongsoo Park; Justin Johnson

arXiv:2211.16421·cs.CV·June 16, 2023

RGB no more: Minimally-decoded JPEG Vision Transformers

Jeongsoo Park, Justin Johnson

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method to train Vision Transformers directly on JPEG encoded features, bypassing decoding overhead, and demonstrates significant speed improvements without accuracy loss.

Contribution

It shows that Vision Transformers can be trained directly on JPEG features without architectural modifications, and introduces data augmentation techniques for this setting.

Findings

01

Up to 39.2% faster training

02

Up to 17.9% faster inference

03

No accuracy loss compared to RGB models

Abstract

Most neural networks for computer vision are designed to infer using RGB images. However, these RGB images are commonly encoded in JPEG before saving to disk; decoding them imposes an unavoidable overhead for RGB networks. Instead, our work focuses on training Vision Transformers (ViT) directly from the encoded features of JPEG. This way, we can avoid most of the decoding overhead, accelerating data load. Existing works have studied this aspect but they focus on CNNs. Due to how these encoded features are structured, CNNs require heavy modification to their architecture to accept such data. Here, we show that this is not the case for ViTs. In addition, we tackle data augmentation directly on these encoded features, which to our knowledge, has not been explored in-depth for training in this setting. With these two improvements -- ViT and data augmentation -- we show that our ViT-Ti model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jeongsoop/rgb-no-more
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications