Panoramic Vision Transformer for Saliency Detection in 360{\deg} Videos
Heeseung Yun, Sehun Lee, Gunhee Kim

TL;DR
The paper introduces PAVER, a novel panoramic vision transformer framework that effectively detects saliency in 360° videos by handling distortions and discontinuities, outperforming existing models without supervision.
Contribution
PAVER leverages a Vision Transformer with deformable convolution for efficient, pretrained model integration and geometric approximation, advancing 360° video saliency detection.
Findings
Outperforms state-of-the-art on Wild360 benchmark
Operates without supervision or auxiliary data
Improves omnidirectional video quality assessment results
Abstract
360 video saliency detection is one of the challenging benchmarks for 360 video understanding since non-negligible distortion and discontinuity occur in the projection of any format of 360 videos, and capture-worthy viewpoint in the omnidirectional sphere is ambiguous by nature. We present a new framework named Panoramic Vision Transformer (PAVER). We design the encoder using Vision Transformer with deformable convolution, which enables us not only to plug pretrained models from normal videos into our architecture without additional modules or finetuning but also to perform geometric approximation only once, unlike previous deep CNN-based approaches. Thanks to its powerful encoder, PAVER can learn the saliency from three simple relative relations among local patch features, outperforming state-of-the-art models for the Wild360 benchmark by large margins without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Image and Video Quality Assessment · Virtual Reality Applications and Impacts
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Dropout · Vision Transformer · Residual Connection
