Leum-VL Technical Report
Yuxuan He, Chaiming Huang, Yifan Wu, Hongjun Wang, Chenkui Shen, Jifan Zhang, Long Li

TL;DR
This paper introduces SV6D, a six-dimensional structural framework for video analysis inspired by professional storyboarding, and presents Leum-VL-8B, a model trained to understand and utilize this structure for improved video understanding.
Contribution
The paper proposes a novel six-dimensional structural representation for videos and develops a new large-scale model trained to leverage this structure for better comprehension.
Findings
Leum-VL-8B achieves competitive scores on multiple video understanding benchmarks.
SV6D enables more accurate identification of timeline-grounded units in videos.
The framework improves downstream tasks like editing, retrieval, and recommendation.
Abstract
A short video succeeds not simply because of what it shows, but because of how it schedules attention -- yet current multimodal models lack the structural grammar to parse or produce this organization. Existing models can describe scenes, answer event-centric questions, and read on-screen text, but they are far less reliable at identifying timeline-grounded units such as hooks, cut rationales, shot-induced tension, and platform-facing packaging cues. We propose SV6D (Structured Video in Six Dimensions), inspired by professional storyboard practice in film and television production, a representation framework that decomposes internet-native video into six complementary structural dimensions -- subject, aesthetics, camera language, editing, narrative, and dissemination -- with each label tied to physically observable evidence on the timeline. We formalize a unified optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
