Leum-VL Technical Report

Yuxuan He; Chaiming Huang; Yifan Wu; Hongjun Wang; Chenkui Shen; Jifan Zhang; Long Li

arXiv:2603.20354·cs.MM·March 24, 2026

Leum-VL Technical Report

Yuxuan He, Chaiming Huang, Yifan Wu, Hongjun Wang, Chenkui Shen, Jifan Zhang, Long Li

PDF

Open Access 1 Models

TL;DR

This paper introduces SV6D, a six-dimensional structural framework for video analysis inspired by professional storyboarding, and presents Leum-VL-8B, a model trained to understand and utilize this structure for improved video understanding.

Contribution

The paper proposes a novel six-dimensional structural representation for videos and develops a new large-scale model trained to leverage this structure for better comprehension.

Findings

01

Leum-VL-8B achieves competitive scores on multiple video understanding benchmarks.

02

SV6D enables more accurate identification of timeline-grounded units in videos.

03

The framework improves downstream tasks like editing, retrieval, and recommendation.

Abstract

A short video succeeds not simply because of what it shows, but because of how it schedules attention -- yet current multimodal models lack the structural grammar to parse or produce this organization. Existing models can describe scenes, answer event-centric questions, and read on-screen text, but they are far less reliable at identifying timeline-grounded units such as hooks, cut rationales, shot-induced tension, and platform-facing packaging cues. We propose SV6D (Structured Video in Six Dimensions), inspired by professional storyboard practice in film and television production, a representation framework that decomposes internet-native video into six complementary structural dimensions -- subject, aesthetics, camera language, editing, narrative, and dissemination -- with each label tied to physically observable evidence on the timeline. We formalize a unified optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
leum-team/Leum-VL-8B-preview0320
model· 171 dl· ♡ 2
171 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition