Decoupling Vision and Language: Codebook Anchored Visual Adaptation
Jason Wu, Tianchen Zhao, Chang Liu, Jiarui Cai, Zheng Zhang, Zhuowei Li, Aaditya Singh, Xiang Xu, Mani Srivastava, Jonathan Wu

TL;DR
This paper introduces CRAFT, a lightweight, decoupled fine-tuning method for vision encoders in LVLMs that improves domain-specific visual task performance by anchoring representations to a stable codebook, without altering the language model.
Contribution
CRAFT is the first approach to decouple vision encoder fine-tuning from the language model using a discrete codebook, enabling flexible domain adaptation across different LVLM architectures.
Findings
Achieves an average of 13.51% performance gain across 10 benchmarks.
Maintains the linguistic capabilities of the original language models.
Outperforms existing continuous token-based adaptation methods.
Abstract
Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
