HumanOmni-Speaker: Identifying Who said What and When
Detao Bai, Shimin Yao, Weixuan Chen, Zhiheng Ma, Xihan Wei, Jingren Zhou

TL;DR
This paper introduces HumanOmni-Speaker, a novel model and benchmark for accurately identifying who spoke what and when in multi-person conversations, overcoming visual biases and capturing fine-grained visual dynamics.
Contribution
It presents a new benchmark and a visual delta encoder-based model that achieve true end-to-end speaker diarization and recognition using only natural language queries.
Findings
HumanOmni-Speaker outperforms existing models on the new benchmark.
The model captures high-frequency visual dynamics like lip movements effectively.
It enables end-to-end lip-reading and precise speaker localization without intrusive cropping.
Abstract
While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
