Relational Graph Learning for Grounded Video Description Generation
Wenqiao Zhang, Xin Eric Wang, Siliang Tang, Haizhou Shi, Haocheng Shi,, Jun Xiao, Yueting Zhuang, William Yang Wang

TL;DR
This paper introduces a relational graph learning framework for grounded video description that enhances fine-grained visual concept understanding and reduces object hallucination by leveraging scene graph representations.
Contribution
It proposes a novel relational graph learning approach with language-refined scene graphs to improve grounded video captioning accuracy and detail.
Findings
Improved accuracy in fine-grained video descriptions.
Reduction in object hallucination in generated captions.
Enhanced grounding of relational words in video regions.
Abstract
Grounded video description (GVD) encourages captioning models to attend to appropriate video regions (e.g., objects) dynamically and generate a description. Such a setting can help explain the decisions of captioning models and prevents the model from hallucinating object words in its description. However, such design mainly focuses on object word generation and thus may ignore fine-grained information and suffer from missing visual concepts. Moreover, relational words (e.g., "jump left or right") are usual spatio-temporal inference results, i.e., these words cannot be grounded on certain spatial regions. To tackle the above limitations, we design a novel relational graph learning framework for GVD, in which a language-refined scene graph representation is designed to explore fine-grained visual concepts. Furthermore, the refined graph can be regarded as relational inductive knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
