SaiVLA-0: Cerebrum--Pons--Cerebellum Tripartite Architecture for Compute-Aware Vision-Language-Action
Xiang Shi, Wenlong Huang, Menglin Zou, Xinhai Sun

TL;DR
SaiVLA-0 introduces a neuroscience-inspired tripartite architecture for vision-language-action tasks, combining stable high-level priors, real-time integration, and fast online control, leading to improved efficiency and success rates.
Contribution
The paper presents a modular, biologically inspired architecture that separates high-level, integration, and control functions, enabling targeted training and enhanced performance in vision-language-action systems.
Findings
Split feature caching reduces training time from 7.5h to 4.5h.
Achieves an average success rate of 92.5% under official training conditions.
Reaches 99.0% mean success in preliminary tests.
Abstract
We revisit Vision-Language-Action through a neuroscience-inspired triad. Biologically, the Cerebrum provides stable high-level multimodal priors and remains frozen; the Pons Adapter integrates these cortical features with real-time proprioceptive inputs and compiles intent into execution-ready tokens; and the Cerebellum (ParaCAT) performs fast, parallel categorical decoding for online control, with hysteresis/EMA/temperature/entropy for stability. A fixed-ratio schedule and two-stage feature caching make the system compute-aware and reproducible. Inspired by active, foveated vision, our wrist ROIs are geometrically tied to the end-effector via calibrated projection, providing a movement-stabilized, high-resolution view that is sensitive to fine-grained pose changes and complements the global context of the main view. The design is modular: upgrading the Cerebrum only retrains the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · EEG and Brain-Computer Interfaces · Multimodal Machine Learning Applications
