Sema: Semantic Transport for Real-Time Multimodal Agents

Jiaying Meng; Bojie Li

arXiv:2604.20940·cs.MM·April 24, 2026

Sema: Semantic Transport for Real-Time Multimodal Agents

Jiaying Meng, Bojie Li

PDF

TL;DR

Sema is a semantic transport system that significantly reduces bandwidth for real-time multimodal agents by transmitting only task-relevant semantic tokens, maintaining accuracy while improving efficiency.

Contribution

It introduces a novel semantic transport approach combining audio tokenization and compact visual representations to drastically cut bandwidth without losing task performance.

Findings

01

Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots.

02

It maintains task accuracy within 0.7 percentage points of raw data baseline.

03

Simulation results demonstrate significant efficiency improvements under WAN conditions.

Abstract

Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task-relevant semantics rather than reconstructing signals in real time. This fundamental difference shifts the transport goal from the technical problem of signal fidelity (Shannon-Weaver Level A) to the semantic problem of meaning preservation (Level B). This mismatch imposes significant overhead. In visual pipelines, screenshot upload accounts for over 60% of end-to-end action latency on constrained uplinks, and in voice pipelines, conventional transport carries massive redundancy, sending 43-64x more data than needed to maintain task accuracy. We present Sema, a semantic transport system that combines discrete…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.