AssistSR: Task-oriented Video Segment Retrieval for Personal AI   Assistant

Stan Weixian Lei; Difei Gao; Yuxuan Wang; Dongxing Mao; Zihan Liang,; Lingmin Ran; Mike Zheng Shou

arXiv:2111.15050·cs.CV·October 11, 2022

AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant

Stan Weixian Lei, Difei Gao, Yuxuan Wang, Dongxing Mao, Zihan Liang,, Lingmin Ran, Mike Zheng Shou

PDF

Open Access 2 Repos

TL;DR

AssistSR introduces a new task and dataset for task-oriented video segment retrieval based on multimodal queries, aiming to enhance personal AI assistants in understanding and retrieving instructional video segments.

Contribution

The paper proposes the TQVSR task, creates the AssistSR dataset, and develops the DME model, advancing multimodal video retrieval for personal AI applications.

Findings

01

DME significantly outperforms baseline methods

02

AssistSR dataset contains 3.2k questions on 1.6k video segments

03

Detailed ablation studies validate the model's effectiveness

Abstract

It is still a pipe dream that personal AI assistants on the phone and AR glasses can assist our daily life in addressing our questions like ``how to adjust the date for this watch?'' and ``how to set its heating duration? (while pointing at an oven)''. The queries used in conventional tasks (i.e. Video Question Answering, Video Retrieval, Moment Localization) are often factoid and based on pure text. In contrast, we present a new task called Task-oriented Question-driven Video Segment Retrieval (TQVSR). Each of our questions is an image-box-text query that focuses on affordance of items in our daily life and expects relevant answer segments to be retrieved from a corpus of instructional video-transcript segments. To support the study of this TQVSR task, we construct a new dataset called AssistSR. We design novel guidelines to create high-quality samples. This dataset contains 3.2k…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning