MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer
Rezaul Karim, He Zhao, Richard P. Wildes, Mennatullah Siam

TL;DR
MED-VT++ introduces a unified multiscale transformer architecture for dense video prediction, leveraging multimodal inputs and a transductive learning scheme to improve accuracy and temporal consistency without optical flow.
Contribution
The paper proposes a novel end-to-end multiscale encoder-decoder transformer for video segmentation, incorporating multimodal processing and a transductive learning scheme for enhanced performance.
Findings
Outperforms state-of-the-art on multiple benchmarks
Effective without optical flow reliance
Demonstrates strong multimodal segmentation results
Abstract
In this paper, we present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video. The presented Multiscale Encoder-Decoder Video Transformer (MED-VT) uses multiscale representation throughout and employs an optional input beyond video (e.g., audio), when available, for multimodal processing (MED-VT++). Multiscale representation at both encoder and decoder yields three key benefits: (i) implicit extraction of spatiotemporal features at different levels of abstraction for capturing dynamics without reliance on input optical flow, (ii) temporal consistency at encoding and (iii) coarse-to-fine detection for high-level (e.g., object) semantics to guide precise localization at decoding. Moreover, we present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · CCD and CMOS Imaging Sensors · Image Processing Techniques and Applications
MethodsAttention Is All You Need · Dropout · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Softmax · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections
