Sigma: The Key for Vision-Language-Action Models toward Telepathic Alignment
Libo Wang

TL;DR
This paper introduces Sigma, a vision-language-action model that achieves telepathic alignment between perception and action through semantic understanding and associative reasoning, without retraining the base model.
Contribution
The work presents a novel VLA architecture and training methodology enabling semantic alignment and intention-driven control in vision-language-action models.
Findings
Sigma reduces control MSE across multiple scales
Maintains stability of telepathy norm and semantic-text alignment
Demonstrates reproducible semantic alignment without retraining base model
Abstract
To address a fundamental limitation in cognitive systems, namely the absence of a time-updatable mediating thought space between semantics and continuous control, this work constructs and trains a vision-language-action model termed Sigma, deployed on a single RTX 4090. The model is built upon the open-source pi0.5_base backbone, with the svla_so101_pickplace dataset preprocessed into a structured training corpus. An independently designed VLA architecture is introduced to integrate deep semantic understanding with associative reasoning, enabling telepathic-style alignment between perception and action. Training proceeds through iterative optimization of data preprocessing, LoRA-based fine-tuning, and inference-stage adapter design. Evaluation is conducted using offline closed-loop replay, comparing Sigma against the untuned pi0.5_base under identical data conditions. Experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Ferroelectric and Negative Capacitance Devices
