EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy

Chi Kit Ng; Long Bai; Guankun Wang; Yupeng Wang; Huxin Gao; Kun Yuan; Chenhan Jin; Tieyong Zeng; Hongliang Ren

arXiv:2505.15206·cs.RO·August 21, 2025·2 cites

EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy

Chi Kit Ng, Long Bai, Guankun Wang, Yupeng Wang, Huxin Gao, Kun Yuan, Chenhan Jin, Tieyong Zeng, Hongliang Ren

PDF

Open Access

TL;DR

EndoVLA is a specialized vision-language-action model designed for autonomous endoscopic tracking, improving robustness and generalization in complex GI environments through dual-phase training.

Contribution

We introduce EndoVLA, a novel dual-phase VLA model tailored for robotic endoscopy, addressing data scarcity and domain shifts for improved autonomous tracking.

Findings

01

Enhanced tracking accuracy in endoscopy

02

Zero-shot generalization to diverse scenes

03

Effective handling of complex sequential tasks

Abstract

In endoscopic procedures, autonomous tracking of abnormal regions and following circumferential cutting markers can significantly reduce the cognitive burden on endoscopists. However, conventional model-based pipelines are fragile for each component (e.g., detection, motion planning) requires manual tuning and struggles to incorporate high-level endoscopic intent, leading to poor generalization across diverse scenes. Vision-Language-Action (VLA) models, which integrate visual perception, language grounding, and motion planning within an end-to-end framework, offer a promising alternative by semantically adapting to surgeon prompts without manual recalibration. Despite their potential, applying VLA models to robotic endoscopy presents unique challenges due to the complex and dynamic anatomical environments of the gastrointestinal (GI) tract. To address this, we introduce EndoVLA,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsColorectal Cancer Screening and Detection