Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Yunheng Li; Hengrui Zhang; Meng-Hao Guo; Wenzhao Gao; Shaoyong Jia; Shaohui Jiao; Qibin Hou; Ming-Ming Cheng

arXiv:2602.13013·cs.CV·February 16, 2026

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Yunheng Li, Hengrui Zhang, Meng-Hao Guo, Wenzhao Gao, Shaoyong Jia, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

PDF

Open Access 2 Models 2 Datasets

TL;DR

This paper introduces a large, structured audiovisual instruction dataset and a new video understanding model that enhances fine-grained captioning and instruction following, advancing universal video understanding.

Contribution

The paper presents ASID-1M, a comprehensive dataset with attribute-structured annotations, a verification pipeline, and a new model trained on this data to improve audiovisual captioning.

Findings

01

Achieves state-of-the-art results among open-source models

02

Reduces hallucinations in captions

03

Improves fine-grained caption quality

Abstract

Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis