WiFi2Cap: Semantic Action Captioning from Wi-Fi CSI via Limb-Level Semantic Alignment

Tzu-Ti Wei; Chu-Yu Huang; Yu-Chee Tseng; Jen-Jee Chen

arXiv:2603.22690·cs.CV·March 25, 2026

WiFi2Cap: Semantic Action Captioning from Wi-Fi CSI via Limb-Level Semantic Alignment

Tzu-Ti Wei, Chu-Yu Huang, Yu-Chee Tseng, Jen-Jee Chen

PDF

Open Access

TL;DR

WiFi2Cap is a novel framework that generates natural language descriptions of human actions from Wi-Fi CSI signals, addressing semantic gaps and ambiguities with cross-modal alignment and a new dataset.

Contribution

The paper introduces WiFi2Cap, a three-stage Wi-Fi CSI-based action captioning framework with a novel Mirror-Consistency Loss and a new benchmark dataset for semantic Wi-Fi sensing.

Findings

01

Outperforms baseline methods on multiple captioning metrics

02

Effectively reduces left/right limb ambiguity in captions

03

Demonstrates privacy-preserving semantic activity understanding

Abstract

Privacy-preserving semantic understanding of human activities is important for indoor sensing, yet existing Wi-Fi CSI-based systems mainly focus on pose estimation or predefined action classification rather than fine-grained language generation. Mapping CSI to natural-language descriptions remains challenging because of the semantic gap between wireless signals and language and direction-sensitive ambiguities such as left/right limb confusion. We propose WiFi2Cap, a three-stage framework for generating action captions directly from Wi-Fi CSI. A vision-language teacher learns transferable supervision from synchronized video-text pairs, and a CSI student is aligned to the teacher's visual space and text embeddings. To improve direction-sensitive captioning, we introduce a Mirror-Consistency Loss that reduces mirrored-action and left-right ambiguities during cross-modal alignment. A…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndoor and Outdoor Localization Technologies · Multimodal Machine Learning Applications · Human Pose and Action Recognition