WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering
Pingyi Chen, Chenglu Zhu, Sunyi Zheng, Honglin Li, Lin Yang

TL;DR
This paper introduces WSI-VQA, a generative visual question answering framework for interpreting whole slide images, enabling diverse diagnostic tasks and outperforming existing models with a new dataset and explainability features.
Contribution
The paper presents a novel generative VQA framework for WSIs, a new dataset with question-answer pairs, and demonstrates improved performance and interpretability over existing methods.
Findings
W2T model outperforms discriminative models in medical correctness.
A new dataset with 8672 question-answer pairs for WSIs is established.
Visual co-attention provides intuitive explanations for diagnostic results.
Abstract
Whole slide imaging is routinely adopted for carcinoma diagnosis and prognosis. Abundant experience is required for pathologists to achieve accurate and reliable diagnostic results of whole slide images (WSI). The huge size and heterogeneous features of WSIs make the workflow of pathological reading extremely time-consuming. In this paper, we propose a novel framework (WSI-VQA) to interpret WSIs by generative visual question answering. WSI-VQA shows universality by reframing various kinds of slide-level tasks in a question-answering pattern, in which pathologists can achieve immunohistochemical grading, survival prediction, and tumor subtyping following human-machine interaction. Furthermore, we establish a WSI-VQA dataset which contains 8672 slide-level question-answering pairs with 977 WSIs. Besides the ability to deal with different slide-level tasks, our generative model which is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam
