Object-Region Video Transformers

Roei Herzig; Elad Ben-Avraham; Karttikeya Mangalam; Amir Bar; Gal; Chechik; Anna Rohrbach; Trevor Darrell; Amir Globerson

arXiv:2110.06915·cs.CV·June 13, 2022

Object-Region Video Transformers

Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal, Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

PDF

Open Access 1 Repo

TL;DR

Object-Region Video Transformers (ORViT) introduce an object-centric approach to video understanding by integrating object representations into transformer layers, improving performance across multiple action recognition and detection tasks.

Contribution

The paper proposes ORViT, a novel object-centric extension to video transformers that incorporates object appearance and dynamics early in the network, enhancing spatio-temporal representations.

Findings

01

Significant performance gains on multiple datasets and tasks.

02

Effective integration of object regions and dynamics into transformer architecture.

03

Improved action recognition and detection accuracy.

Abstract

Recently, video transformers have shown great success in video understanding, exceeding CNN performance; yet existing video transformer models do not explicitly model objects, although objects can be essential for recognizing actions. In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations. The key idea is to fuse object-centric representations starting from early layers and propagate them into the transformer-layers, thus affecting the spatio-temporal representations throughout the network. Our ORViT block consists of two object-level streams: appearance and dynamics. In the appearance stream, an "Object-Region Attention" module applies self-attention over the patches and \emph{object regions}. In this way, visual object regions interact with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eladb3/orvit
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Neural Network Applications