Uncertainty-Guided Self-Questioning and Answering for Video-Language   Alignment

Jin Chen; Kaijing Ma; Haojian Huang; Han Fang; Hao Sun; Mehdi; Hosseinzadeh; Zhe Liu

arXiv:2410.02768·cs.CV·May 7, 2025

Uncertainty-Guided Self-Questioning and Answering for Video-Language Alignment

Jin Chen, Kaijing Ma, Haojian Huang, Han Fang, Hao Sun, Mehdi, Hosseinzadeh, Zhe Liu

PDF

Open Access

TL;DR

This paper introduces BoViLA, a self-training framework using LLM-based self-questioning and answering to improve video-language alignment in VideoQA tasks, with uncertainty filtering to ensure quality.

Contribution

It is the first to explore LLM-based self-training for modality alignment, enhancing VideoQA performance through uncertainty-aware question filtering.

Findings

01

Outperforms state-of-the-art on five VideoQA benchmarks.

02

Effective use of EDL for filtering low-quality self-generated questions.

03

Demonstrates the generality of the self-training framework.

Abstract

The development of multi-modal models has been rapidly advancing, with some demonstrating remarkable capabilities. However, annotating video-text pairs remains expensive and insufficient. Take video question answering (VideoQA) tasks as an example, human annotated questions and answers often cover only part of the video, since the corresponding text is often short and monotonous, leading to underutilization of video. To address this, we propose a Bootstrapping Video-Language Alignment framework (BoViLA), a self-training method that augments question samples during training process through LLM-based self-questioning and answering, which help model exploit video information and the internal knowledge of LLMs more thoroughly to improve modality alignment. However, low-quality self-generated questions may instead contaminate the performance, especially in the early stages of training, as we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques