Sema: Semantic Transport for Real-Time Multimodal Agents
Jiaying Meng, Bojie Li

TL;DR
Sema is a semantic transport system that significantly reduces bandwidth for real-time multimodal agents by transmitting only task-relevant semantic tokens, maintaining accuracy while improving efficiency.
Contribution
It introduces a novel semantic transport approach combining audio tokenization and compact visual representations to drastically cut bandwidth without losing task performance.
Findings
Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots.
It maintains task accuracy within 0.7 percentage points of raw data baseline.
Simulation results demonstrate significant efficiency improvements under WAN conditions.
Abstract
Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task-relevant semantics rather than reconstructing signals in real time. This fundamental difference shifts the transport goal from the technical problem of signal fidelity (Shannon-Weaver Level A) to the semantic problem of meaning preservation (Level B). This mismatch imposes significant overhead. In visual pipelines, screenshot upload accounts for over 60% of end-to-end action latency on constrained uplinks, and in voice pipelines, conventional transport carries massive redundancy, sending 43-64x more data than needed to maintain task accuracy. We present Sema, a semantic transport system that combines discrete…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
