First Place Solution to the Multiple-choice Video QA Track of The Second   Perception Test Challenge

Yingzhe Peng; Yixiao Yuan; Zitian Ao; Huapeng Zhou; Kangqi Wang,; Qipeng Zhu; Xu Yang

arXiv:2409.13538·cs.CV·September 23, 2024

First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge

Yingzhe Peng, Yixiao Yuan, Zitian Ao, Huapeng Zhou, Kangqi Wang,, Qipeng Zhu, Xu Yang

PDF

Open Access

TL;DR

This paper details the first-place solution to a complex video question answering challenge, utilizing a fine-tuned large language model, ensemble techniques, and test time augmentation to achieve top accuracy.

Contribution

The paper introduces a high-performing approach combining model fine-tuning, ensemble strategies, and augmentation for video QA tasks, setting a new benchmark.

Findings

01

Achieved Top-1 Accuracy of 0.7647

02

Effective use of model ensemble and augmentation

03

Demonstrated strong performance on video QA

Abstract

In this report, we present our first-place solution to the Multiple-choice Video Question Answering (QA) track of The Second Perception Test Challenge. This competition posed a complex video understanding task, requiring models to accurately comprehend and answer questions about video content. To address this challenge, we leveraged the powerful QwenVL2 (7B) model and fine-tune it on the provided training set. Additionally, we employed model ensemble strategies and Test Time Augmentation to boost performance. Through continuous optimization, our approach achieved a Top-1 Accuracy of 0.7647 on the leaderboard.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection