Text with Knowledge Graph Augmented Transformer for Video Captioning

Xin Gu; Guang Chen; Yufei Wang; Libo Zhang; Tiejian Luo; Longyin Wen

arXiv:2303.12423·cs.CV·March 28, 2023·6 cites

Text with Knowledge Graph Augmented Transformer for Video Captioning

Xin Gu, Guang Chen, Yufei Wang, Libo Zhang, Tiejian Luo, Longyin Wen

PDF

Open Access

TL;DR

This paper introduces TextKG, a two-stream transformer model augmented with knowledge graphs for improved video captioning, effectively addressing the long-tail words challenge and outperforming state-of-the-art methods on multiple datasets.

Contribution

The paper proposes a novel two-stream transformer architecture with external knowledge integration and cross attention for enhanced video captioning.

Findings

01

Outperforms state-of-the-art on four datasets

02

Improves CIDEr scores by 18.7% on YouCookII

03

Effectively mitigates long-tail words challenge

Abstract

Video captioning aims to describe the content of videos using natural language. Although significant progress has been made, there is still much room to improve the performance for real-world applications, mainly due to the long-tail words challenge. In this paper, we propose a text with knowledge graph augmented transformer (TextKG) for video captioning. Notably, TextKG is a two-stream transformer, formed by the external stream and internal stream. The external stream is designed to absorb additional knowledge, which models the interactions between the additional knowledge, e.g., pre-built knowledge graph, and the built-in information of videos, e.g., the salient object regions, speech transcripts, and video captions, to mitigate the long-tail words challenge. Meanwhile, the internal stream is designed to exploit the multi-modality information in videos (e.g., the appearance of video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition