Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer
Zhihao Zhang, Yiwei Chen, Weizhan Zhang, Caixia Yan, Qinghua Zheng, Qi Wang, Wangdu Chen

TL;DR
This paper introduces MFTR, a transformer-based method for viewport prediction in 360 videos that improves robustness and interpretability by classifying tiles as user interested or not, leveraging multi-modal data.
Contribution
The paper proposes a novel multi-modal fusion transformer approach for tile classification in viewport prediction, enhancing robustness and interpretability over trajectory-based methods.
Findings
MFTR outperforms state-of-the-art methods in accuracy and overlap ratio.
MFTR demonstrates better robustness and interpretability.
The method achieves competitive computational efficiency.
Abstract
Viewport prediction is a crucial aspect of tile-based 360 video streaming system. However, existing trajectory based methods lack of robustness, also oversimplify the process of information construction and fusion between different modality inputs, leading to the error accumulation problem. In this paper, we propose a tile classification based viewport prediction method with Multi-modal Fusion Transformer, namely MFTR. Specifically, MFTR utilizes transformer-based networks to extract the long-range dependencies within each modality, then mine intra- and inter-modality relations to capture the combined impact of user historical inputs and video contents on future viewport selection. In addition, MFTR categorizes future tiles into two categories: user interested or not, and selects future viewport as the region that contains most user interested tiles. Comparing with predicting head…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Adam · Layer Normalization · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Dense Connections
