CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models
Donghee Lee, Rui Cai, Zhe Zhao

TL;DR
CARPE is a lightweight framework that improves large vision-language models by adaptively balancing visual and textual representations, enhancing performance on image classification and vision-language tasks.
Contribution
It introduces a novel ensemble-based method that dynamically prioritizes image features within LVLMs, addressing the weakness in vision-centric capabilities caused by language alignment.
Findings
CARPE improves image classification accuracy.
It enhances performance on diverse vision-language benchmarks.
Modality balancing is crucial for multimodal generalization.
Abstract
Large vision-language models (LVLMs) are typically trained using autoregressive language modeling objectives, which align visual representations with linguistic space. While effective for multimodal reasoning, this alignment can weaken vision-centric capabilities, causing LVLMs to underperform their base vision encoders on tasks such as image classification. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a lightweight framework that integrates raw vision features with aligned LLM representations through vision-integration layers and a context-aware ensemble mechanism. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations. Extensive experiments demonstrate that CARPE improves performance on both image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
