CapsDT: Diffusion-Transformer for Capsule Robot Manipulation
Xiting He, Mingwu Su, Xinqi Jiang, Long Bai, Jiewen Lai, Hongliang Ren

TL;DR
CapsDT is a novel Diffusion Transformer model that integrates visual and textual data to improve capsule robot manipulation in endoscopy, demonstrating state-of-the-art performance in simulated and real-world tasks.
Contribution
This work introduces CapsDT, the first diffusion-transformer-based model for capsule robot control using vision and language inputs, enhancing endoscopy task performance.
Findings
Achieves state-of-the-art results in simulated endoscopy tasks.
Attains 26.25% success rate in real-world capsule manipulation.
Demonstrates robustness across various endoscopy scenarios.
Abstract
Vision-Language-Action (VLA) models have emerged as a prominent research area, showcasing significant potential across a variety of applications. However, their performance in endoscopy robotics, particularly endoscopy capsule robots that perform actions within the digestive system, remains unexplored. The integration of VLA models into endoscopy robots allows more intuitive and efficient interactions between human operators and medical devices, improving both diagnostic accuracy and treatment outcomes. In this work, we design CapsDT, a Diffusion Transformer model for capsule robot manipulation in the stomach. By processing interleaved visual inputs, and textual instructions, CapsDT can infer corresponding robotic control signals to facilitate endoscopy tasks. In addition, we developed a capsule endoscopy robot system, a capsule robot controlled by a robotic arm-held magnet, addressing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoft Robotics and Applications · Gastrointestinal Bleeding Diagnosis and Treatment · Colorectal Cancer Screening and Detection
MethodsLayer Normalization · Dropout · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Softmax · Label Smoothing · Transformer · Diffusion
