ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

Yuying Ge; Yixiao Ge; Chen Li; Teng Wang; Junfu Pu; Yizhuo Li; Lu Qiu; Jin Ma; Lisheng Duan; Xinyu Zuo; Jinwen Luo; Weibo Gu; Zexuan Li; Xiaojing Zhang; Yangyu Tao; Han Hu; Di Wang; Ying Shan

arXiv:2507.20939·cs.CV·July 29, 2025

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, Jinwen Luo, Weibo Gu, Zexuan Li, Xiaojing Zhang, Yangyu Tao, Han Hu, Di Wang, Ying Shan

PDF

3 Models 1 Datasets

TL;DR

ARC-Hunyuan-Video-7B is a multimodal model designed for structured comprehension of real-world short videos, enabling detailed understanding, captioning, question answering, and grounding, with strong performance and efficiency.

Contribution

The paper introduces a novel 7B-parameter multimodal model capable of end-to-end structured video comprehension for complex real-world shorts, with a comprehensive training regimen and new benchmark.

Findings

01

Strong performance on ShortVid-Bench benchmark

02

Effective zero-shot and few-shot downstream application support

03

Real-world deployment improves user engagement and satisfaction

Abstract

Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

TencentARC/ShortVid-Bench
dataset· 1.0k dl
1.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.