Listen to the Unexpected: Self-Supervised Surprise Detection for Efficient Viewport Prediction
Arman Nik Khah, Ravi Prakash

TL;DR
This paper introduces a self-supervised surprise detection framework using spatial audio cues to improve viewport prediction in 360-degree video streaming, reducing bandwidth waste.
Contribution
It presents a novel self-learning approach combining graph neural networks and temporal modeling to detect auditory surprises for enhanced viewport prediction.
Findings
Reduces bitrate waste by up to 18% with audio surprise integration.
Demonstrates effectiveness of auditory cues in viewport prediction.
Validates approach on AVTrack360 dataset.
Abstract
Adaptive streaming of 360-degree video relies on viewport prediction to allocate bandwidth efficiently. Current approaches predominantly use visual saliency or historical gaze patterns, neglecting the role of spatial audio in guiding user attention. This paper presents a self-learning framework for detecting "surprising" auditory events -- moments that deviate from learned temporal expectations -- and demonstrates their utility for viewport prediction. The proposed architecture combines -equivariant graph neural networks with recurrent temporal modeling, trained via a dual self-supervised objective. A key feature is the natural modeling of temporal attention decay: surprise is high at event onset but diminishes as the listener adapts. Experiments on the AVTrack360 dataset show that integrating audio surprise with visual cues reduces bitrate waste by up to 18% compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Image and Video Quality Assessment · Video Analysis and Summarization
