Object-Region Video Transformers
Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal, Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

TL;DR
Object-Region Video Transformers (ORViT) introduce an object-centric approach to video understanding by integrating object representations into transformer layers, improving performance across multiple action recognition and detection tasks.
Contribution
The paper proposes ORViT, a novel object-centric extension to video transformers that incorporates object appearance and dynamics early in the network, enhancing spatio-temporal representations.
Findings
Significant performance gains on multiple datasets and tasks.
Effective integration of object regions and dynamics into transformer architecture.
Improved action recognition and detection accuracy.
Abstract
Recently, video transformers have shown great success in video understanding, exceeding CNN performance; yet existing video transformer models do not explicitly model objects, although objects can be essential for recognizing actions. In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations. The key idea is to fuse object-centric representations starting from early layers and propagate them into the transformer-layers, thus affecting the spatio-temporal representations throughout the network. Our ORViT block consists of two object-level streams: appearance and dynamics. In the appearance stream, an "Object-Region Attention" module applies self-attention over the patches and \emph{object regions}. In this way, visual object regions interact with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Neural Network Applications
