Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models
Riccardo Andrea Izzo, Gianluca Bardaro, Matteo Matteucci

TL;DR
This paper introduces a complexity-aware adaptive inference framework for vision-language-action models that dynamically routes processing based on perceived task difficulty, improving efficiency and robustness.
Contribution
It proposes a novel adaptive routing mechanism inspired by human cognition, transforming VLA models into active detection tools that decide when to act, think, or abstain.
Findings
Vision-only embeddings effectively infer task complexity.
Achieves 80% F1-Score with only 5% of training data.
Enhances efficiency and robustness in real-world robotic tasks.
Abstract
Current research on Vision-Language-Action (VLA) models predominantly focuses on enhancing generalization through established reasoning techniques. While effective, these improvements invariably increase computational complexity and inference latency. Furthermore, these mechanisms are typically applied indiscriminately, resulting in the inefficient allocation of resources for trivial tasks while simultaneously failing to provide the uncertainty estimation necessary to prevent catastrophic failure on out-of-distribution tasks. Inspired by human cognition, we propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state. Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators. This allows the system to execute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning
