CATCH: A Modular Cross-domain Adaptive Template with Hook
Xinjin Li, Yulie Lu, Jinghan Cao, Yu Ma, Zhenglin Li, Yeyang Zhou

TL;DR
CATCH is a flexible, plug-and-play framework that enhances cross-domain VQA model generalization by dynamically injecting lightweight modules for visual and linguistic adaptation without retraining the core model.
Contribution
It introduces a modular hook-based approach with lightweight adapters for domain classification and adaptation, enabling scalable cross-domain VQA without retraining backbone models.
Findings
Achieves consistent performance improvements across multiple domain-specific VQA benchmarks.
Requires no retraining of the backbone model, reducing cost and complexity.
Demonstrates +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, +3.1 ROUGE on ChartQA.
Abstract
Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
