MDS-ViTNet: Improving saliency prediction for Eye-Tracking with Vision   Transformer

Polezhaev Ignat; Goncharenko Igor; Iurina Natalya

arXiv:2405.19501·cs.CV·May 31, 2024

MDS-ViTNet: Improving saliency prediction for Eye-Tracking with Vision Transformer

Polezhaev Ignat, Goncharenko Igor, Iurina Natalya

PDF

Open Access 1 Repo

TL;DR

This paper introduces MDS-ViTNet, a novel vision transformer-based network for improved saliency prediction in eye-tracking, achieving state-of-the-art results and applicable across various fields.

Contribution

The paper presents a new encoder-decoder architecture using Vision Transformer and dual decoders for enhanced saliency prediction, surpassing previous methods.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Utilizes a novel multi-decoder approach for saliency map generation.

03

Demonstrates effectiveness of Vision Transformer in eye-tracking applications.

Abstract

In this paper, we present a novel methodology we call MDS-ViTNet (Multi Decoder Saliency by Vision Transformer Network) for enhancing visual saliency prediction or eye-tracking. This approach holds significant potential for diverse fields, including marketing, medicine, robotics, and retail. We propose a network architecture that leverages the Vision Transformer, moving beyond the conventional ImageNet backbone. The framework adopts an encoder-decoder structure, with the encoder utilizing a Swin transformer to efficiently embed most important features. This process involves a Transfer Learning method, wherein layers from the Vision Transformer are converted by the Encoder Transformer and seamlessly integrated into a CNN Decoder. This methodology ensures minimal information loss from the original input image. The decoder employs a multi-decoding technique, utilizing dual decoders to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ignatpolezhaev/mds-vitnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection

MethodsAttention Is All You Need · Byte Pair Encoding · Label Smoothing · Adam · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Absolute Position Encodings · Softmax · Layer Normalization