Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension
Vasundra Srinivasan

TL;DR
This paper introduces MMA2A, a routing architecture that preserves native multimodal signals in agent networks, significantly improving task accuracy on a benchmark with a modest latency increase.
Contribution
The paper presents MMA2A, a novel routing layer that enhances multimodal signal preservation and improves task performance in agent networks.
Findings
MMA2A achieves 52% task accuracy versus 32% for baseline.
Native routing improves vision-dependent task performance.
Routing increases latency by 1.8 times but enhances accuracy.
Abstract
Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient. We show that modality-native routing in Agent-to-Agent (A2A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize. We present MMA2A, an architecture layer atop A2A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal-CS, a controlled 50-task benchmark with the same LLM backend, same tasks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
