VisionLLM-based Multimodal Fusion Network for Glottic Carcinoma Early Detection
Zhaohui Jin, Yi Shuai, Yongcheng Li, Lingcong Cai, Yun Li, Huifen Liu,, Xiaomao Fan

TL;DR
This paper introduces MMGC-Net, a multimodal fusion network utilizing VisionLLM to improve early detection of glottic carcinoma by integrating image and text data, achieving state-of-the-art accuracy on a new dataset.
Contribution
The paper presents a novel VisionLLM-based multimodal fusion network specifically designed for glottic carcinoma detection, leveraging a new dataset and advanced feature fusion techniques.
Findings
MMGC-Net outperforms previous models on SYSU1H dataset.
Multimodal integration improves detection accuracy.
State-of-the-art results achieved with VisionLLM-based approach.
Abstract
The early detection of glottic carcinoma is critical for improving patient outcomes, as it enables timely intervention, preserves vocal function, and significantly reduces the risk of tumor progression and metastasis. However, the similarity in morphology between glottic carcinoma and vocal cord dysplasia results in suboptimal detection accuracy. To address this issue, we propose a vision large language model-based (VisionLLM-based) multimodal fusion network for glottic carcinoma detection, known as MMGC-Net. By integrating image and text modalities, multimodal models can capture complementary information, leading to more accurate and robust predictions. In this paper, we collect a private real glottic carcinoma dataset named SYSU1H from the First Affiliated Hospital of Sun Yat-sen University, with 5,799 image-text pairs. We leverage an image encoder and additional Q-Former to extract…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHead and Neck Cancer Studies
