TBN-ViT: Temporal Bilateral Network with Vision Transformer for Video   Scene Parsing

Bo Yan; Leilei Cao; Hongbin Wang

arXiv:2112.01033·cs.CV·December 3, 2021·1 cites

TBN-ViT: Temporal Bilateral Network with Vision Transformer for Video Scene Parsing

Bo Yan, Leilei Cao, Hongbin Wang

PDF

Open Access

TL;DR

This paper introduces TBN-ViT, a novel video scene parsing model combining convolutional spatial features, vision transformer context, and inter-frame temporal information to improve accuracy on the VSPW dataset.

Contribution

The paper proposes a Temporal Bilateral Network with Vision Transformer that effectively integrates spatial, contextual, and temporal features for enhanced video scene parsing.

Findings

01

Achieves 49.85% mIoU on VSPW2021 test dataset

02

Combines convolutional and transformer-based features effectively

03

Outperforms previous methods on the VSPW benchmark

Abstract

Video scene parsing in the wild with diverse scenarios is a challenging and great significance task, especially with the rapid development of automatic driving technique. The dataset Video Scene Parsing in the Wild(VSPW) contains well-trimmed long-temporal, dense annotation and high resolution clips. Based on VSPW, we design a Temporal Bilateral Network with Vision Transformer. We first design a spatial path with convolutions to generate low level features which can preserve the spatial information. Meanwhile, a context path with vision transformer is employed to obtain sufficient context information. Furthermore, a temporal context module is designed to harness the inter-frames contextual information. Finally, the proposed method can achieve the mean intersection over union(mIoU) of 49.85\% for the VSPW2021 Challenge test dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Absolute Position Encodings · Residual Connection · Softmax · Adam · Dropout