Decoupling Vision and Language: Codebook Anchored Visual Adaptation

Jason Wu; Tianchen Zhao; Chang Liu; Jiarui Cai; Zheng Zhang; Zhuowei Li; Aaditya Singh; Xiang Xu; Mani Srivastava; Jonathan Wu

arXiv:2602.19449·cs.CV·February 24, 2026

Decoupling Vision and Language: Codebook Anchored Visual Adaptation

Jason Wu, Tianchen Zhao, Chang Liu, Jiarui Cai, Zheng Zhang, Zhuowei Li, Aaditya Singh, Xiang Xu, Mani Srivastava, Jonathan Wu

PDF

Open Access

TL;DR

This paper introduces CRAFT, a lightweight, decoupled fine-tuning method for vision encoders in LVLMs that improves domain-specific visual task performance by anchoring representations to a stable codebook, without altering the language model.

Contribution

CRAFT is the first approach to decouple vision encoder fine-tuning from the language model using a discrete codebook, enabling flexible domain adaptation across different LVLM architectures.

Findings

01

Achieves an average of 13.51% performance gain across 10 benchmarks.

02

Maintains the linguistic capabilities of the original language models.

03

Outperforms existing continuous token-based adaptation methods.

Abstract

Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis