TL;DR
This paper introduces K2Sight, a framework that leverages structured semantic supervision and domain knowledge decomposition to improve abnormality grounding in medical images with smaller models and less data.
Contribution
K2Sight is a novel approach that decomposes clinical concepts into visual attributes and uses them as supervision, enabling data-efficient training of compact models for medical grounding tasks.
Findings
Achieves comparable or better performance than larger models.
Uses only 1.5% of data required by state-of-the-art models.
Improves $mAP_{50}$ by up to 9.82%.
Abstract
In this work, we address the problem of grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. While generalist Vision-Language Models (VLMs) excel in natural grounding tasks, they often struggle in the medical domain due to rare, compositional, and domain-specific terms that are poorly aligned with visual patterns. Specialized medical VLMs address this challenge via large-scale domain pretraining, but at the cost of substantial annotation and computational resources. To overcome these limitations, we propose \textbf{Knowledge to Sight (K2Sight)}, a framework that introduces structured semantic supervision by decomposing clinical concepts into interpretable visual attributes, such as shape, density, and anatomical location. These attributes are distilled from domain ontologies and encoded into concise instruction-style…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
