Omnivore: A Single Model for Many Visual Modalities

Rohit Girdhar; Mannat Singh; Nikhila Ravi; Laurens van der; Maaten; Armand Joulin; Ishan Misra

arXiv:2201.08377·cs.CV·April 1, 2022

Omnivore: A Single Model for Many Visual Modalities

Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der, Maaten, Armand Joulin, Ishan Misra

PDF

Open Access 2 Repos

TL;DR

This paper introduces Omnivore, a transformer-based model capable of classifying images, videos, and 3D data with a single set of parameters, achieving competitive results across multiple visual modalities.

Contribution

The paper presents a unified model that handles diverse visual data types using shared parameters, simplifying training and improving cross-modal recognition capabilities.

Findings

01

Achieves 86.0% on ImageNet

02

Obtains 84.1% on Kinetics

03

Reaches 67.1% on SUN RGB-D

Abstract

Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our 'Omnivore' model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. Omnivore is simple to train, uses off-the-shelf standard datasets, and performs at-par or better than modality-specific models of the same size. A single Omnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN RGB-D. After finetuning, our models outperform prior work on a variety of vision tasks and generalize across modalities. Omnivore's shared visual representation naturally enables cross-modal recognition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Visual Attention and Saliency Detection