Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation
Philipp Harzig, Moritz Einfalt, Rainer Lienhart

TL;DR
This paper introduces a novel Fractional Positional Encoding method for Transformers, enhancing video-to-text translation by better synchronizing audio-visual features, leading to state-of-the-art results on multiple datasets.
Contribution
It proposes a new Fractional Positional Encoding technique for Transformers, improving audio-visual feature synchronization in video-to-text tasks.
Findings
FPE increases CIDEr score by 8.6%
Achieves state-of-the-art results on MSR-VTT and MSVD datasets
Improves CIDEr and BLEU-4 scores significantly over vanilla Transformer
Abstract
Video-to-Text (VTT) is the task of automatically generating descriptions for short audio-visual video clips, which can support visually impaired people to understand scenes of a YouTube video for instance. Transformer architectures have shown great performance in both machine translation and image captioning, lacking a straightforward and reproducible application for VTT. However, there is no comprehensive study on different strategies and advice for video description generation including exploiting the accompanying audio with fully self-attentive networks. Thus, we explore promising approaches from image captioning and video processing and apply them to VTT by developing a straightforward Transformer architecture. Additionally, we present a novel way of synchronizing audio and video features in Transformers which we call Fractional Positional Encoding (FPE). We run multiple experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Subtitles and Audiovisual Media · Cancer-related molecular mechanisms research
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Dense Connections
