Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization
Ayan Banerjee, Kuntal Thakur, Sandeep Gupta

TL;DR
This paper introduces a theoretical framework and a multimodal vision-language model approach to improve single-source domain generalization in medical image classification tasks, addressing unknown causal domain differences.
Contribution
It proposes domain conformal bounds for assessing domain divergence and a novel GenEval model combining foundational models with human knowledge to enhance generalization.
Findings
GenEval achieves 69.2% accuracy on DR datasets, outperforming baselines by 9.4%.
GenEval achieves 81% accuracy on SOZ datasets, outperforming baselines by 1.8%.
Theoretical framework helps evaluate domain divergence without metadata.
Abstract
Generalizing image classification across domains remains challenging in critical tasks such as fundus image-based diabetic retinopathy (DR) grading and resting-state fMRI seizure onset zone (SOZ) detection. When domains differ in unknown causal factors, achieving cross-domain generalization is difficult, and there is no established methodology to objectively assess such differences without direct metadata or protocol-level information from data collectors, which is typically inaccessible. We first introduce domain conformal bounds (DCB), a theoretical framework to evaluate whether domains diverge in unknown causal factors. Building on this, we propose GenEval, a multimodal Vision Language Models (VLM) approach that combines foundational models (e.g., MedGemma-4B) with human knowledge via Low-Rank Adaptation (LoRA) to bridge causal gaps and enhance single-source domain generalization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · EEG and Brain-Computer Interfaces
