TinyHD: Efficient Video Saliency Prediction with Heterogeneous Decoders using Hierarchical Maps Distillation
Feiyan Hu, Simone Palazzo, Federica Proietto Salanitri, Giovanni, Bellitto, Morteza Moradi, Concetto Spampinato, Kevin McGuinness

TL;DR
TinyHD introduces a lightweight, efficient video saliency prediction model using heterogeneous decoders and hierarchical knowledge distillation, achieving state-of-the-art accuracy with reduced computational costs.
Contribution
The paper presents a novel lightweight architecture with multiple decoders and hierarchical distillation techniques for efficient video saliency prediction.
Findings
Achieves comparable or better accuracy than state-of-the-art methods.
Significantly reduces computational costs.
Effective use of hierarchical multi-map knowledge distillation.
Abstract
Video saliency prediction has recently attracted attention of the research community, as it is an upstream task for several practical applications. However, current solutions are particularly computationally demanding, especially due to the wide usage of spatio-temporal 3D convolutions. We observe that, while different model architectures achieve similar performance on benchmarks, visual variations between predicted saliency maps are still significant. Inspired by this intuition, we propose a lightweight model that employs multiple simple heterogeneous decoders and adopts several practical approaches to improve accuracy while keeping computational costs low, such as hierarchical multi-map knowledge distillation, multi-output saliency prediction, unlabeled auxiliary datasets and channel reduction with teacher assistant supervision. Our approach achieves saliency prediction accuracy on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Image and Video Quality Assessment
