Pilot-guided Multimodal Semantic Communication for Audio-Visual Event   Localization

Fei Yu; Zhe Xiang; Nan Che; Zhuoran Zhang; Yuandi Li; Junxiao Xue,; Zhiguo Wan

arXiv:2412.06208·cs.SD·December 10, 2024

Pilot-guided Multimodal Semantic Communication for Audio-Visual Event Localization

Fei Yu, Zhe Xiang, Nan Che, Zhuoran Zhang, Yuandi Li, Junxiao Xue,, Zhiguo Wan

PDF

Open Access

TL;DR

This paper introduces a pilot-guided multimodal semantic communication framework for audio-visual event localization, improving robustness and performance over existing methods in dynamic real-world scenarios.

Contribution

It proposes a novel pilot-guided framework with Euler-based multimodal encoding and decoding, addressing the limitations of current single-modality and analog channel approaches.

Findings

01

Outperforms benchmarks in Signal-to-Noise Ratio (SNR)

02

Demonstrates robustness to channel variations

03

Supports diverse communication scenarios

Abstract

Multimodal semantic communication, which integrates various data modalities such as text, images, and audio, significantly enhances communication efficiency and reliability. It has broad application prospects in fields such as artificial intelligence, autonomous driving, and smart homes. However, current research primarily relies on analog channels and assumes constant channel states (perfect CSI), which is inadequate for addressing dynamic physical channels and noise in real-world scenarios. Existing methods often focus on single modality tasks and fail to handle multimodal stream data, such as video and audio, and their corresponding tasks. Furthermore, current semantic encoding and decoding modules mainly transmit single modality features, neglecting the need for multimodal semantic enhancement and recognition tasks. To address these challenges, this paper proposes a pilot-guided…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing

MethodsFocus