HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Jiaxing Zhao, Qize Yang, Yixing Peng, Detao Bai, Shimin Yao, Boyuan, Sun, Xiang Chen, Shenghao Fu, Weixuan chen, Xihan Wei, Liefeng Bo

TL;DR
HumanOmni is a large, specialized vision-speech language model designed for comprehensive understanding of human-centric videos, integrating visual and audio data with adaptive scene-specific processing.
Contribution
It introduces the first large-scale human-centric dataset and a multimodal model with specialized branches for improved scene understanding.
Findings
Outperforms existing models in emotion recognition and action understanding.
Effectively fuses visual and audio features for comprehensive scene analysis.
Demonstrates significant improvements in human-centric video comprehension tasks.
Abstract
In human-centric scenes, the ability to simultaneously understand visual and auditory information is crucial. While recent omni models can process multiple modalities, they generally lack effectiveness in human-centric scenes due to the absence of large-scale, specialized datasets and non-targeted architectures. In this work, we developed HumanOmni, the industry's first human-centric Omni-multimodal large language model. We constructed a dataset containing over 2.4 million human-centric video clips with detailed captions and more than 14 million instructions, facilitating the understanding of diverse human-centric scenes. HumanOmni includes three specialized branches for understanding different types of scenes. It adaptively fuses features from these branches based on user instructions, significantly enhancing visual understanding in scenes centered around individuals. Moreover,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
