JMSC: Joint Spatial–Temporal Modeling with Semantic Completion for Audio–Visual Learning

Xinfu Xu; Fan Yang; Zhibin Yu

PMC · DOI:10.3390/s26041288·February 16, 2026

JMSC: Joint Spatial–Temporal Modeling with Semantic Completion for Audio–Visual Learning

Xinfu Xu, Fan Yang, Zhibin Yu

PDF

Open Access

TL;DR

This paper introduces JMSC, a new framework for audio-visual learning that improves understanding of dynamic scenes by combining spatial and temporal information with semantic completion.

Contribution

The novel JMSC framework uses cross-modal latent reconstruction and joint modeling of spatial and temporal features under audio guidance.

Findings

01

JMSC achieves state-of-the-art performance on multiple audio-visual tasks.

02

The method maintains high computational efficiency while improving semantic understanding.

03

Cross-modal reconstruction enhances the model's ability to capture complementary audio-visual semantics.

Abstract

Audio–visual learning seeks to achieve holistic scene understanding by integrating auditory and visual cues. Early research focused on fully fine-tuning pre-trained models, incurring high computational costs. Consequently, recent studies have adopted parameter-efficient tuning methods to adapt large-scale vision models to the audio–visual domain. Despite the competitive performance of existing methods, several challenges persist. Firstly, effectively leveraging the complementary semantics between the audio and visual modalities remains difficult, as these two modalities capture fundamentally different aspects of a video. Secondly, comprehending dynamic video context is challenging because both spatial attributes (such as scale) and temporal characteristics (such as motion) of objects co-evolve over time, making semantic comprehension more complex. To address these challenges, we propose…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species2

Canis lupus familiaris(dog · subspecies)Homo sapiens(human · species)

Chemicals1

AVEL100k

Diseases6

AVE injury to CLS CMLR JSTM JMSC

Figures8

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Multimodal Machine Learning Applications