Panoramic Vision Transformer for Saliency Detection in 360{\deg} Videos

Heeseung Yun; Sehun Lee; Gunhee Kim

arXiv:2209.08956·cs.CV·September 20, 2022·1 cites

Panoramic Vision Transformer for Saliency Detection in 360{\deg} Videos

Heeseung Yun, Sehun Lee, Gunhee Kim

PDF

Open Access 1 Repo

TL;DR

The paper introduces PAVER, a novel panoramic vision transformer framework that effectively detects saliency in 360° videos by handling distortions and discontinuities, outperforming existing models without supervision.

Contribution

PAVER leverages a Vision Transformer with deformable convolution for efficient, pretrained model integration and geometric approximation, advancing 360° video saliency detection.

Findings

01

Outperforms state-of-the-art on Wild360 benchmark

02

Operates without supervision or auxiliary data

03

Improves omnidirectional video quality assessment results

Abstract

360 $^{\circ}$ video saliency detection is one of the challenging benchmarks for 360 $^{\circ}$ video understanding since non-negligible distortion and discontinuity occur in the projection of any format of 360 $^{\circ}$ videos, and capture-worthy viewpoint in the omnidirectional sphere is ambiguous by nature. We present a new framework named Panoramic Vision Transformer (PAVER). We design the encoder using Vision Transformer with deformable convolution, which enables us not only to plug pretrained models from normal videos into our architecture without additional modules or finetuning but also to perform geometric approximation only once, unlike previous deep CNN-based approaches. Thanks to its powerful encoder, PAVER can learn the saliency from three simple relative relations among local patch features, outperforming state-of-the-art models for the Wild360 benchmark by large margins without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hs-yn/paver
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Image and Video Quality Assessment · Virtual Reality Applications and Impacts

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Dropout · Vision Transformer · Residual Connection