# JMSC: Joint Spatial–Temporal Modeling with Semantic Completion for Audio–Visual Learning

**Authors:** Xinfu Xu, Fan Yang, Zhibin Yu

PMC · DOI: 10.3390/s26041288 · 2026-02-16

## TL;DR

This paper introduces JMSC, a new framework for audio-visual learning that improves understanding of dynamic scenes by combining spatial and temporal information with semantic completion.

## Contribution

The novel JMSC framework uses cross-modal latent reconstruction and joint modeling of spatial and temporal features under audio guidance.

## Key findings

- JMSC achieves state-of-the-art performance on multiple audio-visual tasks.
- The method maintains high computational efficiency while improving semantic understanding.
- Cross-modal reconstruction enhances the model's ability to capture complementary audio-visual semantics.

## Abstract

Audio–visual learning seeks to achieve holistic scene understanding by integrating auditory and visual cues. Early research focused on fully fine-tuning pre-trained models, incurring high computational costs. Consequently, recent studies have adopted parameter-efficient tuning methods to adapt large-scale vision models to the audio–visual domain. Despite the competitive performance of existing methods, several challenges persist. Firstly, effectively leveraging the complementary semantics between the audio and visual modalities remains difficult, as these two modalities capture fundamentally different aspects of a video. Secondly, comprehending dynamic video context is challenging because both spatial attributes (such as scale) and temporal characteristics (such as motion) of objects co-evolve over time, making semantic comprehension more complex. To address these challenges, we propose a novel framework, named Joint Spatial–Temporal Modeling with Semantic Completion (JMSC). JMSC introduces cross-modal latent reconstruction, which moves beyond shallow correlation by encouraging the model to reconstruct one modality’s complete semantic summary from a masked version of its counterpart. Furthermore, JMSC learns a unified representation of video spatial attributes and temporal changes by jointly modeling them under audio guidance, enabling accurate localization and consistent tracking in dynamic video scenes. Experimental results demonstrate that JMSC achieves state-of-the-art performance across multiple downstream tasks while maintaining high computational efficiency.

## Full-text entities

- **Diseases:** AVE (MESH:D014786), injury to (MESH:D014947), CLS (MESH:D038921), CMLR (MESH:D000085343), JSTM (MESH:D008569), JMSC (MESH:D057180)
- **Chemicals:** AVEL100k (-)
- **Species:** Canis lupus familiaris (dog, subspecies) [taxon 9615], Homo sapiens (human, species) [taxon 9606]

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12943970/full.md

---
Source: https://tomesphere.com/paper/PMC12943970