MDS-ViTNet: Improving saliency prediction for Eye-Tracking with Vision Transformer
Polezhaev Ignat, Goncharenko Igor, Iurina Natalya

TL;DR
This paper introduces MDS-ViTNet, a novel vision transformer-based network for improved saliency prediction in eye-tracking, achieving state-of-the-art results and applicable across various fields.
Contribution
The paper presents a new encoder-decoder architecture using Vision Transformer and dual decoders for enhanced saliency prediction, surpassing previous methods.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Utilizes a novel multi-decoder approach for saliency map generation.
Demonstrates effectiveness of Vision Transformer in eye-tracking applications.
Abstract
In this paper, we present a novel methodology we call MDS-ViTNet (Multi Decoder Saliency by Vision Transformer Network) for enhancing visual saliency prediction or eye-tracking. This approach holds significant potential for diverse fields, including marketing, medicine, robotics, and retail. We propose a network architecture that leverages the Vision Transformer, moving beyond the conventional ImageNet backbone. The framework adopts an encoder-decoder structure, with the encoder utilizing a Swin transformer to efficiently embed most important features. This process involves a Transfer Learning method, wherein layers from the Vision Transformer are converted by the Encoder Transformer and seamlessly integrated into a CNN Decoder. This methodology ensures minimal information loss from the original input image. The decoder employs a multi-decoding technique, utilizing dual decoders to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection
MethodsAttention Is All You Need · Byte Pair Encoding · Label Smoothing · Adam · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Absolute Position Encodings · Softmax · Layer Normalization
