A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step   Inference

Chao Zhang; Shiwei Wu; Sirui Zhao; Tong Xu; Enhong Chen

arXiv:2306.14412·cs.CV·June 27, 2023

A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step Inference

Chao Zhang, Shiwei Wu, Sirui Zhao, Tong Xu, Enhong Chen

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel video alignment method for multi-step inference in egocentric instructional videos, significantly improving AI assistant guidance and achieving second place in the CVPR 2023 AQTC challenge.

Contribution

It introduces an enhanced video alignment approach using VideoCLIP, question grounding, feature reweighting, and GRU-based inference for better task completion.

Findings

01

Secured 2nd place in CVPR'2023 AQTC challenge.

02

Demonstrated superior performance over existing methods.

03

Effective multi-step inference in instructional videos.

Abstract

Affordance-centric Question-driven Task Completion (AQTC) for Egocentric Assistant introduces a groundbreaking scenario. In this scenario, through learning instructional videos, AI assistants provide users with step-by-step guidance on operating devices. In this paper, we present a solution for enhancing video alignment to improve multi-step inference. Specifically, we first utilize VideoCLIP to generate video-script alignment features. Afterwards, we ground the question-relevant content in instructional videos. Then, we reweight the multimodal context to emphasize prominent features. Finally, we adopt GRU to conduct multi-step inference. Through comprehensive experiments, we demonstrate the effectiveness and superiority of our method, which secured the 2nd place in CVPR'2023 AQTC challenge. Our code is available at https://github.com/zcfinal/LOVEU-CVPR23-AQTC.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zcfinal/loveu-cvpr23-aqtc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Human Pose and Action Recognition

MethodsGated Recurrent Unit