Joint learning of images and videos with a single Vision Transformer
Shuki Shimizu, Toru Tamaki

TL;DR
This paper introduces a unified Vision Transformer model that jointly learns from images and videos, leveraging temporal aggregation for videos, and demonstrates its effectiveness across multiple datasets.
Contribution
The paper presents a novel single-model approach for joint image and video learning using Vision Transformer with temporal aggregation.
Findings
Effective on multiple image and action recognition datasets
Achieves competitive performance with separate models
Demonstrates the feasibility of unified image-video learning
Abstract
In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer IV-ViT, and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Domain Adaptation and Few-Shot Learning · Advanced Vision and Imaging
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Softmax · Dense Connections
