A Multi-level Alignment Training Scheme for Video-and-Language Grounding
Yubo Zhang, Feiyang Niu, Qing Ping, Govind Thattai

TL;DR
This paper introduces a multi-level alignment training scheme that enhances video-and-language grounding by capturing semantic relations at global and segment levels, improving encoding similarity and task performance.
Contribution
The work proposes a novel multi-level alignment training scheme using contrastive loss to better encode semantic relations in shared feature space for video-language grounding.
Findings
Achieved comparable performance to state-of-the-art on multiple datasets.
Effectively captures semantic relations at multiple levels.
Improves encoding similarity between related video and language pairs.
Abstract
To solve video-and-language grounding tasks, the key is for the network to understand the connection between the two modalities. For a pair of video and language description, their semantic relation is reflected by their encodings' similarity. A good multi-modality encoder should be able to well capture both inputs' semantics and encode them in the shared feature space where embedding distance gets properly translated into their semantic similarity. In this work, we focused on this semantic connection between video and language, and developed a multi-level alignment training scheme to directly shape the encoding process. Global and segment levels of video-language alignment pairs were designed, based on the information similarity ranging from high-level context to fine-grained semantics. The contrastive loss was used to contrast the encodings' similarities between the positive and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
