Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding

Jun Li; Che Liu; Wenjia Bai; Mingxuan Liu; Rossella Arcucci; Cosmin I. Bercea; Julia A. Schnabel

arXiv:2508.04572·cs.CV·August 7, 2025

Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding

Jun Li, Che Liu, Wenjia Bai, Mingxuan Liu, Rossella Arcucci, Cosmin I. Bercea, Julia A. Schnabel

PDF

1 Models

TL;DR

This paper introduces K2Sight, a framework that leverages structured semantic supervision and domain knowledge decomposition to improve abnormality grounding in medical images with smaller models and less data.

Contribution

K2Sight is a novel approach that decomposes clinical concepts into visual attributes and uses them as supervision, enabling data-efficient training of compact models for medical grounding tasks.

Findings

01

Achieves comparable or better performance than larger models.

02

Uses only 1.5% of data required by state-of-the-art models.

03

Improves $mAP_{50}$ by up to 9.82%.

Abstract

In this work, we address the problem of grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. While generalist Vision-Language Models (VLMs) excel in natural grounding tasks, they often struggle in the medical domain due to rare, compositional, and domain-specific terms that are poorly aligned with visual patterns. Specialized medical VLMs address this challenge via large-scale domain pretraining, but at the cost of substantial annotation and computational resources. To overcome these limitations, we propose \textbf{Knowledge to Sight (K2Sight)}, a framework that introduces structured semantic supervision by decomposing clinical concepts into interpretable visual attributes, such as shape, density, and anatomical location. These attributes are distilled from domain ontologies and encoded into concise instruction-style…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
RioJune/AG-KD
model· 23 dl
23 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.