Pilot-guided Multimodal Semantic Communication for Audio-Visual Event Localization
Fei Yu, Zhe Xiang, Nan Che, Zhuoran Zhang, Yuandi Li, Junxiao Xue,, Zhiguo Wan

TL;DR
This paper introduces a pilot-guided multimodal semantic communication framework for audio-visual event localization, improving robustness and performance over existing methods in dynamic real-world scenarios.
Contribution
It proposes a novel pilot-guided framework with Euler-based multimodal encoding and decoding, addressing the limitations of current single-modality and analog channel approaches.
Findings
Outperforms benchmarks in Signal-to-Noise Ratio (SNR)
Demonstrates robustness to channel variations
Supports diverse communication scenarios
Abstract
Multimodal semantic communication, which integrates various data modalities such as text, images, and audio, significantly enhances communication efficiency and reliability. It has broad application prospects in fields such as artificial intelligence, autonomous driving, and smart homes. However, current research primarily relies on analog channels and assumes constant channel states (perfect CSI), which is inadequate for addressing dynamic physical channels and noise in real-world scenarios. Existing methods often focus on single modality tasks and fail to handle multimodal stream data, such as video and audio, and their corresponding tasks. Furthermore, current semantic encoding and decoding modules mainly transmit single modality features, neglecting the need for multimodal semantic enhancement and recognition tasks. To address these challenges, this paper proposes a pilot-guided…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
MethodsFocus
