Joint learning of images and videos with a single Vision Transformer

Shuki Shimizu; Toru Tamaki

arXiv:2308.10533·cs.CV·August 22, 2023

Joint learning of images and videos with a single Vision Transformer

Shuki Shimizu, Toru Tamaki

PDF

Open Access

TL;DR

This paper introduces a unified Vision Transformer model that jointly learns from images and videos, leveraging temporal aggregation for videos, and demonstrates its effectiveness across multiple datasets.

Contribution

The paper presents a novel single-model approach for joint image and video learning using Vision Transformer with temporal aggregation.

Findings

01

Effective on multiple image and action recognition datasets

02

Achieves competitive performance with separate models

03

Demonstrates the feasibility of unified image-video learning

Abstract

In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer IV-ViT, and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Domain Adaptation and Few-Shot Learning · Advanced Vision and Imaging

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Softmax · Dense Connections