Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

Zinuo Li; Xian Zhang; Yongxin Guo; Mohammed Bennamoun; Farid Boussaid; Girish Dwivedi; Luqi Gong; Qiuhong Ke

arXiv:2505.18110·cs.CL·February 3, 2026

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

Zinuo Li, Xian Zhang, Yongxin Guo, Mohammed Bennamoun, Farid Boussaid, Girish Dwivedi, Luqi Gong, Qiuhong Ke

PDF

1 Datasets 1 Video

TL;DR

This paper introduces TriSense, a multimodal large language model that integrates visual, audio, and speech cues for comprehensive video understanding, supported by a large dataset, TriSense-2M, to enhance multimodal analysis.

Contribution

The paper presents TriSense, a novel triple-modality LLM with adaptive modality reweighting, and introduces TriSense-2M, a large dataset for training and evaluating multimodal video understanding models.

Findings

01

TriSense outperforms existing models on multiple benchmarks.

02

The adaptive reweighting mechanism improves robustness under modality dropout.

03

TriSense-2M enables broad generalization in multimodal video analysis.

Abstract

Humans naturally understand moments in a video by integrating visual and auditory cues. For example, localizing a scene in the video like "A scientist passionately speaks on wildlife conservation as dramatic orchestral music plays, with the audience nodding and applauding" requires simultaneous processing of visual, audio, and speech signals. However, existing models often struggle to effectively fuse and interpret audio information, limiting their capacity for comprehensive video temporal understanding. To address this, we present TriSense, a triple-modality large language model designed for holistic video temporal understanding through the integration of visual, audio, and speech modalities. Central to TriSense is a Query-Based Connector that adaptively reweights modality contributions based on the input query, enabling robust performance under modality dropout and allowing flexible…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

zinuoli/TriSense-2M
dataset· 22 dl
22 dl

Videos

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM· slideslive