SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng, Liu, Yumao Lu, Lijuan Wang

TL;DR
SwinBERT introduces an end-to-end transformer model for video captioning that directly processes video frame patches, utilizing sparse attention to improve long-range sequence modeling and outperform previous methods across multiple datasets.
Contribution
The paper presents SwinBERT, a novel end-to-end transformer architecture with adaptive sparse attention for video captioning, eliminating the need for separate feature extractors and enhancing performance.
Findings
Significant performance improvements over previous methods on 5 datasets.
Dense sampling of video frames benefits captioning accuracy.
Sparse attention masks improve long-range video understanding.
Abstract
The canonical approach to video captioning dictates a caption generation model to learn from offline-extracted dense video features. These feature extractors usually operate on video frames sampled at a fixed frame rate and are often trained on image/video understanding tasks, without adaption to video captioning data. In this work, we present SwinBERT, an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description. Instead of leveraging multiple 2D/3D feature extractors, our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input without dedicated design for different frame rates. Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames as opposed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
