Collaborative Three-Stream Transformers for Video Captioning

Hao Wang; Libo Zhang; Heng Fan; Tiejian Luo

arXiv:2309.09611·cs.CV·September 19, 2023

Collaborative Three-Stream Transformers for Video Captioning

Hao Wang, Libo Zhang, Heng Fan, Tiejian Luo

PDF

TL;DR

This paper introduces COST, a three-stream transformer framework that models subject, predicate, and object separately to improve video captioning by capturing multi-granular visual-linguistic interactions.

Contribution

The paper proposes a novel three-branch transformer architecture with cross-granularity attention for enhanced video captioning.

Findings

01

Outperforms state-of-the-art on YouCookII, ActivityNet Captions, MSVD datasets.

02

Effectively models multi-granular interactions between video components and text.

03

End-to-end training demonstrates strong generalization across datasets.

Abstract

As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each other for better representation. Specifically, COST is formed by three branches of transformers to exploit the visual-linguistic interactions of different granularities in spatial-temporal domain between videos and text, detected objects and text, and actions and text. Meanwhile, we propose a cross-granularity attention module to align the interactions modeled by the three branches of transformers, then the three branches of transformers can support each other to exploit the most discriminative semantic information of different granularities for accurate predictions of captions. The whole model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN