TL;DR
NanoVLA introduces a lightweight, decoupled vision-language-action architecture that significantly reduces inference latency and resource usage on edge devices, enabling efficient robotic manipulation without sacrificing accuracy.
Contribution
The paper presents a novel decoupled architecture with dynamic routing and action chunking, optimizing vision-language models for resource-constrained robotic applications.
Findings
Achieves up to 52x faster inference on edge devices.
Uses 98% fewer parameters while maintaining accuracy.
Demonstrates effective real-world robotic manipulation.
Abstract
Vision-language-action (VLA) models have significantly advanced robotic manipulation by integrating vision-language models (VLMs), and action decoders into a unified architecture. However, their deployment on resource-constrained edge devices, such as mobile robots or embedded systems (e.g., Jetson Orin Nano), remains challenging due to high computational demands, especially in real-world scenarios where power, latency, and computational resources are critical. To close this gap, we introduce Nano-scale Vision-Language Action (NanoVLA), a family of lightweight VLA architectures that achieve high performance with minimal resources. Our core innovations include: (1) vision-language decoupling that moves conventional early vision and language inputs fusion in VLM to late stage, achieving better performance while enabling caching and reduce inference overhead and latency; (2) long-short…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper is well-structured with clear motivation. It demonstrates on-edge VLA control with improved success and major latency gains, addressing a potential deployment blocker for household and mobile manipulation. Reported numbers are compelling: SOTA-competitive LIBERO performance with far fewer parameters, strong LeRobot success rates, and notably higher FPS on Orin Nano.
**LSAC**: While LSAC is effective, the idea of predicting longer sequences and executing shorter sub-segments with periodic replans has been used in Diffusion Policy and Pi0; the paper could better articulate what is novel. **Routing signal is text-only.** The router is trained as a text-conditioned comparator over models. This assumes that instruction phrasing correlates tightly with task difficulty. However, “pick up the banana” may range from trivial on a clear table to hard in a fruit pile
* The problem of deploying VLA policies on edge devices is well motivated and addresses a real practical constraint for many robotics practitioners who do not have access to server-grade hardware for model deployment. * The experiments include both simulated and real-world robot tasks, validating the effectiveness of NanoVLA across various tasks and domains. * Simulated evaluations in LIBERO show superior performance of NanoVLA compared to various prior methods, including OpenVLA, $\pi_0$, SmolV
* LIBERO experimental results do not include several state-of-the-art prior works from early 2025, including OpenVLA-OFT (97.1% success rate - RSS 2025) and UniVLA (95.2% success rate - RSS 2025). These works use earlier fusion of language and vision representations and obtain substantially higher performance in LIBERO than the proposed NanoVLA (84.1% success rate). * The authors argue that late fusion of the modalities is a superior approach, but analysis of an early fusion alternative with the
- Employed two strategies to optimize the long latency issue in VLA, save 62% inference time compared to the traditional VLA approach. - Detailed latency and performance analysis.
- The model structure is overly simplistic and similar to exist work. Both Diffusion Policy[1] and Scaling-Up Diffusion Policy[2] utilize lightweight transformer to late integrate various different modalities in an End-to-end training manner. NanoVLA and these method exactly has very low latency, but it loses the language-vision alignment capability that VLMs have acquired through extensive training, which greatly affects the generalization ability of VLA. - The experiments on real world are ove
This paper tries to tackle an important issue in robotics of running efficiency. The method contains a lot of careful designs. The paper shows good results on performance and speed on LIBERO given the number of parameters tuned.
More discussion on the advantages of the late fusion technique is needed. Comparison with other baselines on improving VLA efficiency is needed. See the questions below for other minor points.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
