SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

Wei Li; Renshan Zhang; Rui Shao; Zhijian Fang; Kaiwen Zhou; Zhuotao Tian; Liqiang Nie

arXiv:2511.10518·cs.CV·November 14, 2025

SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kaiwen Zhou, Zhuotao Tian, Liqiang Nie

PDF

Open Access 1 Video

TL;DR

SemanticVLA introduces a novel framework that enhances robotic manipulation by sparsifying perceptual inputs, aligning semantics effectively, and integrating perception with action for improved efficiency and performance.

Contribution

It proposes SemanticVLA, a new VLA framework with semantic-aligned sparsification and enhanced perception-action integration, achieving state-of-the-art results in robotic manipulation tasks.

Findings

01

Surpasses OpenVLA by 21.1% success rate on LIBERO benchmark.

02

Reduces training cost by 3.0 times and inference latency by 2.7 times.

03

Sets new state-of-the-art in efficiency and performance for robotic manipulation.

Abstract

Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: 1) perceptual redundancy, where irrelevant visual inputs are processed inefficiently, and 2) superficial instruction-vision alignment, which hampers semantic grounding of actions. In this paper, we propose SemanticVLA, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: 1) To sparsify redundant perception while preserving semantic alignment, Semantic-guided Dual Visual Pruner (SD-Pruner) performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. 2) To exploit sparsified features and integrate semantics with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation· underline

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics