Synchronized Audio-Visual Frames with Fractional Positional Encoding for   Transformers in Video-to-Text Translation

Philipp Harzig; Moritz Einfalt; Rainer Lienhart

arXiv:2112.14088·cs.CV·December 30, 2021

Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation

Philipp Harzig, Moritz Einfalt, Rainer Lienhart

PDF

Open Access

TL;DR

This paper introduces a novel Fractional Positional Encoding method for Transformers, enhancing video-to-text translation by better synchronizing audio-visual features, leading to state-of-the-art results on multiple datasets.

Contribution

It proposes a new Fractional Positional Encoding technique for Transformers, improving audio-visual feature synchronization in video-to-text tasks.

Findings

01

FPE increases CIDEr score by 8.6%

02

Achieves state-of-the-art results on MSR-VTT and MSVD datasets

03

Improves CIDEr and BLEU-4 scores significantly over vanilla Transformer

Abstract

Video-to-Text (VTT) is the task of automatically generating descriptions for short audio-visual video clips, which can support visually impaired people to understand scenes of a YouTube video for instance. Transformer architectures have shown great performance in both machine translation and image captioning, lacking a straightforward and reproducible application for VTT. However, there is no comprehensive study on different strategies and advice for video description generation including exploiting the accompanying audio with fully self-attentive networks. Thus, we explore promising approaches from image captioning and video processing and apply them to VTT by developing a straightforward Transformer architecture. Additionally, we present a novel way of synchronizing audio and video features in Transformers which we call Fractional Positional Encoding (FPE). We run multiple experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Subtitles and Audiovisual Media · Cancer-related molecular mechanisms research

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Dense Connections