Elysium: Exploring Object-level Perception in Videos via MLLM

Han Wang; Yanjie Wang; Yongjie Ye; Yuxiang Nie; and Can Huang

arXiv:2403.16558·cs.CV·April 1, 2024·1 cites

Elysium: Exploring Object-level Perception in Videos via MLLM

Han Wang, Yanjie Wang, Yongjie Ye, Yuxiang Nie, and Can Huang

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces Elysium, a large-scale video dataset and a novel MLLM approach for object-level perception in videos, addressing challenges of inter-frame understanding and computational efficiency.

Contribution

We present ElysiumTrack-1M, a new dataset for video object perception, and propose T-Selector, a token compression model for efficient multi-frame processing in MLLMs.

Findings

01

ElysiumTrack-1M contains 1.27 million annotated frames.

02

T-Selector improves processing efficiency in MLLMs.

03

The end-to-end model performs object-level tasks without extra plug-ins.

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied. This lack of exploration is primarily due to two key challenges. Firstly, extensive pretraining on large-scale video datasets is required to equip MLLMs with the capability to perceive objects across multiple frames and understand inter-frame relationships. Secondly, processing a large number of frames within the context window of Large Language Models (LLMs) can impose a significant computational burden. To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset supported for three tasks: Single Object Tracking (SOT), Referring Single Object Tracking (RSOT), and Video Referring Expression Generation (Video-REG). ElysiumTrack-1M contains 1.27 million…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hon-wong/elysium
pytorchOfficial

Models

🤗
sty-yyj/elysium_7b
model· 78 dl· ♡ 1
78 dl♡ 1

Datasets

sty-yyj/ElysiumTrack-1M
dataset· 128 dl
128 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis