Adaptive Vision-Language Model Routing for Computer Use Agents
Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

TL;DR
This paper introduces Adaptive VLM Routing (AVR), a cost-effective framework for dynamically selecting vision-language models for computer use agents based on action difficulty, improving efficiency while maintaining accuracy.
Contribution
The paper proposes AVR, a novel semantic routing framework that adaptively routes GUI action predictions to different VLMs based on estimated difficulty, reducing inference costs significantly.
Findings
AVR reduces inference costs by up to 78%.
AVR maintains accuracy within 2% of using only large models.
AVR effectively handles high-risk actions with safety measures.
Abstract
Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Graph Neural Networks
