Multi-granularity Contrastive Cross-modal Collaborative Generation for   End-to-End Long-term Video Question Answering

Ting Yu; Kunhao Fu; Jian Zhang; Qingming Huang; Jun Yu

arXiv:2410.09379·cs.CV·October 15, 2024

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

Ting Yu, Kunhao Fu, Jian Zhang, Qingming Huang, Jun Yu

PDF

1 Repo

TL;DR

This paper introduces an end-to-end multi-granularity contrastive cross-modal collaborative generation model for long-term VideoQA, improving semantic understanding and answer generation by integrating unimodal modeling, contrastive learning, and generative reformulation.

Contribution

The paper proposes a novel end-to-end framework combining joint unimodal modeling, contrastive learning, and generative reformulation for long-term VideoQA, addressing domain-specific reasoning and task formulation issues.

Findings

01

Outperforms existing methods on six VideoQA datasets.

02

Effectively captures semantic correspondences through contrastive learning.

03

Reformulates VideoQA as a generative task for better reasoning.

Abstract

Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task focusing on semantic understanding of untrimmed long-term videos and diverse free-form questions, simultaneously emphasizing comprehensive cross-modal reasoning to yield precise answers. The canonical approaches often rely on off-the-shelf feature extractors to detour the expensive computation overhead, but often result in domain-independent modality-unrelated representations. Furthermore, the inherent gradient blocking between unimodal comprehension and cross-modal interaction hinders reliable answer generation. In contrast, recent emerging successful video-language pre-training models enable cost-effective end-to-end modeling but fall short in domain-specific ratiocination and exhibit disparities in task formulation. Toward this end, we present an entirely end-to-end solution for long-term…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openmicg/mcg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Learning