SwinBERT: End-to-End Transformers with Sparse Attention for Video   Captioning

Kevin Lin; Linjie Li; Chung-Ching Lin; Faisal Ahmed; Zhe Gan; Zicheng; Liu; Yumao Lu; Lijuan Wang

arXiv:2111.13196·cs.CV·June 22, 2022·31 cites

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng, Liu, Yumao Lu, Lijuan Wang

PDF

Open Access 1 Repo

TL;DR

SwinBERT introduces an end-to-end transformer model for video captioning that directly processes video frame patches, utilizing sparse attention to improve long-range sequence modeling and outperform previous methods across multiple datasets.

Contribution

The paper presents SwinBERT, a novel end-to-end transformer architecture with adaptive sparse attention for video captioning, eliminating the need for separate feature extractors and enhancing performance.

Findings

01

Significant performance improvements over previous methods on 5 datasets.

02

Dense sampling of video frames benefits captioning accuracy.

03

Sparse attention masks improve long-range video understanding.

Abstract

The canonical approach to video captioning dictates a caption generation model to learn from offline-extracted dense video features. These feature extractors usually operate on video frames sampled at a fixed frame rate and are often trained on image/video understanding tasks, without adaption to video captioning data. In this work, we present SwinBERT, an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description. Instead of leveraging multiple 2D/3D feature extractors, our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input without dedicated design for different frame rates. Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames as opposed to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/swinbert
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning