Vision-Language Models on the Edge for Real-Time Robotic Perception

Sarat Ahmad; Maryam Hafeez; Syed Ali Raza Zaidi

arXiv:2601.14921·cs.RO·January 22, 2026

Vision-Language Models on the Edge for Real-Time Robotic Perception

Sarat Ahmad, Maryam Hafeez, Syed Ali Raza Zaidi

PDF

Open Access

TL;DR

This paper explores deploying vision-language models on edge infrastructure like 6G Open RAN and MEC to enable real-time robotic perception, balancing latency, resource constraints, and accuracy.

Contribution

It demonstrates the deployment of VLMs on edge infrastructure using a WebRTC pipeline and evaluates models like LLaMA-3.2-11B-Vision-Instruct and Qwen2-VL-2B-Instruct in real-time robotic scenarios.

Findings

01

Edge deployment reduces latency by 5% compared to cloud.

02

Compact models achieve sub-second response times.

03

Trade-off observed between latency reduction and accuracy.

Abstract

Vision-Language Models (VLMs) enable multimodal reasoning for robotic perception and interaction, but their deployment in real-world systems remains constrained by latency, limited onboard resources, and privacy risks of cloud offloading. Edge intelligence within 6G, particularly Open RAN and Multi-access Edge Computing (MEC), offers a pathway to address these challenges by bringing computation closer to the data source. This work investigates the deployment of VLMs on ORAN/MEC infrastructure using the Unitree G1 humanoid robot as an embodied testbed. We design a WebRTC-based pipeline that streams multimodal data to an edge node and evaluate LLaMA-3.2-11B-Vision-Instruct deployed at the edge versus in the cloud under real-time conditions. Our results show that edge deployment preserves near-cloud accuracy while reducing end-to-end latency by 5\%. We further evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Advanced Neural Network Applications · Multimodal Machine Learning Applications