A Vision-Language-Action Model for Adaptive Ultrasound-Guided Needle Insertion and Needle Tracking
Yuelin Zhang, Qingpeng Ding, Longxiang Tang, Chengyu Fang, Shing Shin Cheng

TL;DR
This paper introduces a unified Vision-Language-Action model for real-time, adaptive ultrasound-guided needle insertion and tracking, improving safety and efficiency in robotic ultrasound procedures.
Contribution
The paper presents a novel integrated framework combining needle tracking and insertion control with real-time adaptive capabilities for robotic ultrasound systems.
Findings
Outperforms state-of-the-art trackers in accuracy
Achieves higher insertion success rates
Reduces procedure time significantly
Abstract
Ultrasound (US)-guided needle insertion is a critical yet challenging procedure due to dynamic imaging conditions and difficulties in needle visualization. Many methods have been proposed for automated needle insertion, but they often rely on hand-crafted pipelines with modular controllers, whose performance degrades in challenging cases. In this paper, a Vision-Language-Action (VLA) model is proposed for adaptive and automated US-guided needle insertion and tracking on a robotic ultrasound (RUS) system. This framework provides a unified approach to needle tracking and needle insertion control, enabling real-time, dynamically adaptive adjustment of insertion based on the obtained needle position and environment awareness. To achieve real-time and end-to-end tracking, a Cross-Depth Fusion (CDF) tracking head is proposed, integrating shallow positional and deep semantic features from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
