A Vision-Language-Action Model for Adaptive Ultrasound-Guided Needle Insertion and Needle Tracking

Yuelin Zhang; Qingpeng Ding; Longxiang Tang; Chengyu Fang; Shing Shin Cheng

arXiv:2604.20347·cs.RO·April 23, 2026

A Vision-Language-Action Model for Adaptive Ultrasound-Guided Needle Insertion and Needle Tracking

Yuelin Zhang, Qingpeng Ding, Longxiang Tang, Chengyu Fang, Shing Shin Cheng

PDF

TL;DR

This paper introduces a unified Vision-Language-Action model for real-time, adaptive ultrasound-guided needle insertion and tracking, improving safety and efficiency in robotic ultrasound procedures.

Contribution

The paper presents a novel integrated framework combining needle tracking and insertion control with real-time adaptive capabilities for robotic ultrasound systems.

Findings

01

Outperforms state-of-the-art trackers in accuracy

02

Achieves higher insertion success rates

03

Reduces procedure time significantly

Abstract

Ultrasound (US)-guided needle insertion is a critical yet challenging procedure due to dynamic imaging conditions and difficulties in needle visualization. Many methods have been proposed for automated needle insertion, but they often rely on hand-crafted pipelines with modular controllers, whose performance degrades in challenging cases. In this paper, a Vision-Language-Action (VLA) model is proposed for adaptive and automated US-guided needle insertion and tracking on a robotic ultrasound (RUS) system. This framework provides a unified approach to needle tracking and needle insertion control, enabling real-time, dynamically adaptive adjustment of insertion based on the obtained needle position and environment awareness. To achieve real-time and end-to-end tracking, a Cross-Depth Fusion (CDF) tracking head is proposed, integrating shallow positional and deep semantic features from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.