Elysium: Exploring Object-level Perception in Videos via MLLM
Han Wang, Yanjie Wang, Yongjie Ye, Yuxiang Nie, and Can Huang

TL;DR
This paper introduces Elysium, a large-scale video dataset and a novel MLLM approach for object-level perception in videos, addressing challenges of inter-frame understanding and computational efficiency.
Contribution
We present ElysiumTrack-1M, a new dataset for video object perception, and propose T-Selector, a token compression model for efficient multi-frame processing in MLLMs.
Findings
ElysiumTrack-1M contains 1.27 million annotated frames.
T-Selector improves processing efficiency in MLLMs.
The end-to-end model performs object-level tasks without extra plug-ins.
Abstract
Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied. This lack of exploration is primarily due to two key challenges. Firstly, extensive pretraining on large-scale video datasets is required to equip MLLMs with the capability to perceive objects across multiple frames and understand inter-frame relationships. Secondly, processing a large number of frames within the context window of Large Language Models (LLMs) can impose a significant computational burden. To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset supported for three tasks: Single Object Tracking (SOT), Referring Single Object Tracking (RSOT), and Video Referring Expression Generation (Video-REG). ElysiumTrack-1M contains 1.27 million…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
