Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models

Riccardo Andrea Izzo; Gianluca Bardaro; Matteo Matteucci

arXiv:2603.05147·cs.CV·March 6, 2026

Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models

Riccardo Andrea Izzo, Gianluca Bardaro, Matteo Matteucci

PDF

Open Access

TL;DR

This paper introduces a complexity-aware adaptive inference framework for vision-language-action models that dynamically routes processing based on perceived task difficulty, improving efficiency and robustness.

Contribution

It proposes a novel adaptive routing mechanism inspired by human cognition, transforming VLA models into active detection tools that decide when to act, think, or abstain.

Findings

01

Vision-only embeddings effectively infer task complexity.

02

Achieves 80% F1-Score with only 5% of training data.

03

Enhances efficiency and robustness in real-world robotic tasks.

Abstract

Current research on Vision-Language-Action (VLA) models predominantly focuses on enhancing generalization through established reasoning techniques. While effective, these improvements invariably increase computational complexity and inference latency. Furthermore, these mechanisms are typically applied indiscriminately, resulting in the inefficient allocation of resources for trivial tasks while simultaneously failing to provide the uncertainty estimation necessary to prevent catastrophic failure on out-of-distribution tasks. Inspired by human cognition, we propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state. Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators. This allows the system to execute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning