Task-Oriented Communication for Human Action Understanding via Edge-Cloud Co-Inference
Jingyi Liu, Cheng Yuan, Lijun He, Jun Zhang, Jiawei Shao

TL;DR
This paper introduces a task-oriented edge-cloud framework for human action understanding that significantly reduces data transmission and latency by transmitting compact motion tokens instead of raw videos, maintaining high accuracy.
Contribution
The proposed framework combines pose estimation, vector quantized autoencoders, and vision-language models to enable efficient, privacy-preserving action understanding with minimal data transfer.
Findings
Reduces transmission payload to about 1% of traditional methods.
Lowers system latency to around 20% of video codec solutions.
Achieves comparable action understanding accuracy with less data and faster processing.
Abstract
The expanding application of smart sensing has created a growing demand for the accurate understanding of human action at the network edge. Traditional approaches require massive video data to be transmitted from resource-constrained edge devices to powerful cloud servers, incurring prohibitive uplink bandwidth consumption and unacceptable latency while raising privacy concerns. To overcome these bottlenecks, we propose a task-oriented communication framework for human action understanding (TOAU) through edge-cloud collaboration. Our framework utilizes a monocular pose estimator to extract continuous joint coordinates from raw videos, followed by a vector quantized variational autoencoder (VQ-VAE) to convert these coordinates into discrete motion tokens. Consequently, only a compact sequence of codebook indices is transmitted over the network, consuming as few as 9 bits per frame and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
