TL;DR
This paper explores adapting multimodal large language models for autonomous skeletal landmark localization to improve C-arm control, demonstrating competitive accuracy and reasoning capabilities.
Contribution
It introduces a novel approach using fine-tuned MLLMs for landmark localization, enabling autonomous C-arm positioning with reasoning and correction abilities.
Findings
MLLMs achieve competitive localization accuracy compared to deep learning methods.
Qualitative results show MLLMs can reason and correct initial predictions.
MLLMs can sequentially navigate C-arm towards target landmarks.
Abstract
Purpose: Automated C-arm positioning ensures timely treatment in patients requiring emergent interventions. When a conventional Deep Learning (DL) approach for C-arm control fails, clinicians must revert to manual operation, resulting in additional delays. Consequently, an agentic C-arm control framework based on multimodal large language models (MLLMs) is highly desirable, as it can incorporate clinician feedback and use reasoning to make adjustments toward more accurate positioning. Skeletal landmark localization is essential for C-arm control, and we investigate adapting MLLMs for autonomous landmark localization. Methods: We used an annotated synthetic X-ray dataset and a real X-ray dataset. Each X-ray in both datasets is paired with several skeletal landmarks. We fine-tuned two MLLMs and tasked them with retrieving the closest landmarks from each X-ray. Quantitative evaluations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
