Optimizing Latency for Online Video CaptioningUsing Audio-Visual   Transformers

Chiori Hori; Takaaki Hori; Jonathan Le Roux

arXiv:2108.02147·cs.CV·August 5, 2021

Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers

Chiori Hori, Takaaki Hori, Jonathan Le Roux

PDF

Open Access

TL;DR

This paper introduces a novel audio-visual Transformer-based method for low-latency online video captioning, enabling early caption generation with high quality by optimizing timing based on event detection and partial video frames.

Contribution

It proposes a joint training approach of a Transformer and timing detector to produce accurate captions early in video streams, reducing latency significantly.

Findings

01

Achieves 94% caption quality using only 28% of initial frames

02

Enables early captioning triggered by event detection or prediction

03

Outperforms traditional methods in latency-accuracy trade-off

Abstract

Video captioning is an essential technology to understand scenes and describe events in natural language. To apply it to real-time monitoring, a system needs not only to describe events accurately but also to produce the captions as soon as possible. Low-latency captioning is needed to realize such functionality, but this research area for online video captioning has not been pursued yet. This paper proposes a novel approach to optimize each caption's output timing based on a trade-off between latency and caption quality. An audio-visual Trans-former is trained to generate ground-truth captions using only a small portion of all video frames, and to mimic outputs of a pre-trained Transformer to which all the frames are given. A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Trans-formers become sufficiently close to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Dropout · Dense Connections · Adam · Label Smoothing · Residual Connection