WiFi2Cap: Semantic Action Captioning from Wi-Fi CSI via Limb-Level Semantic Alignment
Tzu-Ti Wei, Chu-Yu Huang, Yu-Chee Tseng, Jen-Jee Chen

TL;DR
WiFi2Cap is a novel framework that generates natural language descriptions of human actions from Wi-Fi CSI signals, addressing semantic gaps and ambiguities with cross-modal alignment and a new dataset.
Contribution
The paper introduces WiFi2Cap, a three-stage Wi-Fi CSI-based action captioning framework with a novel Mirror-Consistency Loss and a new benchmark dataset for semantic Wi-Fi sensing.
Findings
Outperforms baseline methods on multiple captioning metrics
Effectively reduces left/right limb ambiguity in captions
Demonstrates privacy-preserving semantic activity understanding
Abstract
Privacy-preserving semantic understanding of human activities is important for indoor sensing, yet existing Wi-Fi CSI-based systems mainly focus on pose estimation or predefined action classification rather than fine-grained language generation. Mapping CSI to natural-language descriptions remains challenging because of the semantic gap between wireless signals and language and direction-sensitive ambiguities such as left/right limb confusion. We propose WiFi2Cap, a three-stage framework for generating action captions directly from Wi-Fi CSI. A vision-language teacher learns transferable supervision from synchronized video-text pairs, and a CSI student is aligned to the teacher's visual space and text embeddings. To improve direction-sensitive captioning, we introduce a Mirror-Consistency Loss that reduces mirrored-action and left-right ambiguities during cross-modal alignment. A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndoor and Outdoor Localization Technologies · Multimodal Machine Learning Applications · Human Pose and Action Recognition
