Artemis: Towards Referential Understanding in Complex Videos

Jihao Qiu; Yuan Zhang; Xi Tang; Lingxi Xie; Tianren Ma and; Pengyu Yan; David Doermann; Qixiang Ye; Yunjie Tian

arXiv:2406.00258·cs.CV·June 4, 2024

Artemis: Towards Referential Understanding in Complex Videos

Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma and, Pengyu Yan, David Doermann, Qixiang Ye, Yunjie Tian

PDF

Open Access 1 Repo 1 Video

TL;DR

Artemis is a multimodal large language model designed to improve referential understanding in complex videos by accurately identifying and describing targets based on natural-language questions and bounding boxes.

Contribution

We introduce Artemis, a novel MLLM that enhances video referential understanding through target-specific feature extraction and a new VideoRef45K dataset with a three-stage training process.

Findings

01

Achieves promising quantitative and qualitative results.

02

Successfully integrates with video grounding and summarization tools.

03

Demonstrates improved referential understanding in complex videos.

Abstract

Videos carry rich visual information including object description, action, interaction, etc., but the existing multimodal large language models (MLLMs) fell short in referential understanding scenarios such as video-based referring. In this paper, we present Artemis, an MLLM that pushes video-based referential understanding to a finer level. Given a video, Artemis receives a natural-language question with a bounding box in any video frame and describes the referred target in the entire video. The key to achieving this goal lies in extracting compact, target-specific video features, where we set a solid baseline by tracking and selecting spatiotemporal features from the video. We train Artemis on the newly established VideoRef45K dataset with 45K video-QA pairs and design a computationally efficient, three-stage training procedure. Results are promising both quantitatively and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qiujihao19/artemis
pytorchOfficial

Videos

Artemis: Towards Referential Understanding in Complex Videos· slideslive

Taxonomy

TopicsVideo Analysis and Summarization · Human Pose and Action Recognition · Image Retrieval and Classification Techniques

MethodsSparse Evolutionary Training