Text with Knowledge Graph Augmented Transformer for Video Captioning
Xin Gu, Guang Chen, Yufei Wang, Libo Zhang, Tiejian Luo, Longyin Wen

TL;DR
This paper introduces TextKG, a two-stream transformer model augmented with knowledge graphs for improved video captioning, effectively addressing the long-tail words challenge and outperforming state-of-the-art methods on multiple datasets.
Contribution
The paper proposes a novel two-stream transformer architecture with external knowledge integration and cross attention for enhanced video captioning.
Findings
Outperforms state-of-the-art on four datasets
Improves CIDEr scores by 18.7% on YouCookII
Effectively mitigates long-tail words challenge
Abstract
Video captioning aims to describe the content of videos using natural language. Although significant progress has been made, there is still much room to improve the performance for real-world applications, mainly due to the long-tail words challenge. In this paper, we propose a text with knowledge graph augmented transformer (TextKG) for video captioning. Notably, TextKG is a two-stream transformer, formed by the external stream and internal stream. The external stream is designed to absorb additional knowledge, which models the interactions between the additional knowledge, e.g., pre-built knowledge graph, and the built-in information of videos, e.g., the salient object regions, speech transcripts, and video captions, to mitigate the long-tail words challenge. Meanwhile, the internal stream is designed to exploit the multi-modality information in videos (e.g., the appearance of video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
