Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension

Vasundra Srinivasan

arXiv:2604.12213·cs.AI·April 16, 2026

Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension

Vasundra Srinivasan

PDF

TL;DR

This paper introduces MMA2A, a routing architecture that preserves native multimodal signals in agent networks, significantly improving task accuracy on a benchmark with a modest latency increase.

Contribution

The paper presents MMA2A, a novel routing layer that enhances multimodal signal preservation and improves task performance in agent networks.

Findings

01

MMA2A achieves 52% task accuracy versus 32% for baseline.

02

Native routing improves vision-dependent task performance.

03

Routing increases latency by 1.8 times but enhances accuracy.

Abstract

Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient. We show that modality-native routing in Agent-to-Agent (A2A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize. We present MMA2A, an architecture layer atop A2A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal-CS, a controlled 50-task benchmark with the same LLM backend, same tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.