One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries
Mayank Saini, Arit Kumar Bishwas

TL;DR
This paper introduces an adaptive AI framework that dynamically orchestrates multiple modalities and tools for autonomous query processing, significantly improving efficiency and reducing costs while maintaining accuracy.
Contribution
It presents a novel agentic AI system with a central Supervisor that coordinates multimodal tools using adaptive routing strategies, enhancing autonomous query handling.
Findings
72% reduction in time-to-accurate-answer
85% reduction in conversational rework
67% cost reduction
Abstract
We present an agentic AI framework for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities. A central Supervisor dynamically decomposes user queries, delegates subtasks to modality-appropriate tools (e.g., object detection, OCR, speech transcription), and synthesizes results through adaptive routing strategies rather than predetermined decision trees. For text-only queries, the framework uses learned routing via RouteLLM, while non-text paths use SLM-assisted modality decomposition. Evaluated on 2,847 queries across 15 task categories, our framework achieves 72% reduction in time-to-accurate-answer, 85% reduction in conversational rework, and 67% cost reduction compared to the matched hierarchical baseline while maintaining accuracy parity. These results demonstrate that intelligent centralized orchestration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling
