MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder   Video Transformer

Rezaul Karim; He Zhao; Richard P. Wildes; Mennatullah Siam

arXiv:2304.05930·cs.CV·September 18, 2024·1 cites

MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer

Rezaul Karim, He Zhao, Richard P. Wildes, Mennatullah Siam

PDF

Open Access

TL;DR

MED-VT++ introduces a unified multiscale transformer architecture for dense video prediction, leveraging multimodal inputs and a transductive learning scheme to improve accuracy and temporal consistency without optical flow.

Contribution

The paper proposes a novel end-to-end multiscale encoder-decoder transformer for video segmentation, incorporating multimodal processing and a transductive learning scheme for enhanced performance.

Findings

01

Outperforms state-of-the-art on multiple benchmarks

02

Effective without optical flow reliance

03

Demonstrates strong multimodal segmentation results

Abstract

In this paper, we present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video. The presented Multiscale Encoder-Decoder Video Transformer (MED-VT) uses multiscale representation throughout and employs an optional input beyond video (e.g., audio), when available, for multimodal processing (MED-VT++). Multiscale representation at both encoder and decoder yields three key benefits: (i) implicit extraction of spatiotemporal features at different levels of abstraction for capturing dynamics without reliance on input optical flow, (ii) temporal consistency at encoding and (iii) coarse-to-fine detection for high-level (e.g., object) semantics to guide precise localization at decoding. Moreover, we present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · CCD and CMOS Imaging Sensors · Image Processing Techniques and Applications

MethodsAttention Is All You Need · Dropout · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Softmax · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections