Prompting Video-Language Foundation Models with Domain-specific   Fine-grained Heuristics for Video Question Answering

Ting Yu; Kunhao Fu; Shuhui Wang; Qingming Huang; Jun Yu

arXiv:2410.09380·cs.CV·October 15, 2024

Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

Ting Yu, Kunhao Fu, Shuhui Wang, Qingming Huang, Jun Yu

PDF

TL;DR

This paper introduces HeurVidQA, a framework that enhances video-language models for VideoQA by incorporating domain-specific heuristics, leading to improved reasoning and accuracy across multiple datasets.

Contribution

The paper presents a novel approach that uses domain-specific entity-action heuristics to refine pre-trained video-language models for better VideoQA performance.

Findings

01

Significant performance improvements on multiple VideoQA datasets

02

Effective use of domain-specific heuristics to guide model reasoning

03

Enhanced focus on key entities and actions improves accuracy

Abstract

Video Question Answering (VideoQA) represents a crucial intersection between video understanding and language processing, requiring both discriminative unimodal comprehension and sophisticated cross-modal interaction for accurate inference. Despite advancements in multi-modal pre-trained models and video-language foundation models, these systems often struggle with domain-specific VideoQA due to their generalized pre-training objectives. Addressing this gap necessitates bridging the divide between broad cross-modal knowledge and the specific inference demands of VideoQA tasks. To this end, we introduce HeurVidQA, a framework that leverages domain-specific entity-action heuristics to refine pre-trained video-language foundation models. Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus