First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge
Yingzhe Peng, Yixiao Yuan, Zitian Ao, Huapeng Zhou, Kangqi Wang,, Qipeng Zhu, Xu Yang

TL;DR
This paper details the first-place solution to a complex video question answering challenge, utilizing a fine-tuned large language model, ensemble techniques, and test time augmentation to achieve top accuracy.
Contribution
The paper introduces a high-performing approach combining model fine-tuning, ensemble strategies, and augmentation for video QA tasks, setting a new benchmark.
Findings
Achieved Top-1 Accuracy of 0.7647
Effective use of model ensemble and augmentation
Demonstrated strong performance on video QA
Abstract
In this report, we present our first-place solution to the Multiple-choice Video Question Answering (QA) track of The Second Perception Test Challenge. This competition posed a complex video understanding task, requiring models to accurately comprehend and answer questions about video content. To address this challenge, we leveraged the powerful QwenVL2 (7B) model and fine-tune it on the provided training set. Additionally, we employed model ensemble strategies and Test Time Augmentation to boost performance. Through continuous optimization, our approach achieved a Top-1 Accuracy of 0.7647 on the leaderboard.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection
