Technical Report for CVPR 2022 LOVEU AQTC Challenge
Hyeonyu Kim, Jongeun Kim, Jeonghun Kang, Sanguk Park, Dongchan Park, and Taehwan Kim

TL;DR
This technical report details the development of a top-performing model for the AQTC task in CVPR 2022 LOVEU, introducing a novel attention mechanism to handle multi-modal, multi-step video question answering challenges.
Contribution
The paper proposes a new context ground module attention mechanism and provides comprehensive analysis and ablation studies for multi-modal video question answering.
Findings
Achieved 2nd place overall in LOVEU challenge track 3
Secured 1st place in two evaluation metrics
Demonstrated effectiveness of the proposed attention mechanism
Abstract
This technical report presents the 2nd winning model for AQTC, a task newly introduced in CVPR 2022 LOng-form VidEo Understanding (LOVEU) challenges. This challenge faces difficulties with multi-step answers, multi-modal, and diverse and changing button representations in video. We address this problem by proposing a new context ground module attention mechanism for more effective feature mapping. In addition, we also perform the analysis over the number of buttons and ablation study of different step networks and video features. As a result, we achieved the overall 2nd place in LOVEU competition track 3, specifically the 1st place in two out of four evaluation metrics. Our code is available at https://github.com/jaykim9870/ CVPR-22_LOVEU_unipyler.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
