POINTS: Improving Your Vision-language Model with Affordable Strategies
Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou

TL;DR
This paper introduces affordable, effective strategies for improving vision-language models, including data filtering, model ablations, and model soup techniques, resulting in a competitive 9B parameter model.
Contribution
The paper presents a robust baseline, data filtering with perplexity, and model soup methods to enhance vision-language models efficiently.
Findings
Achieved competitive performance with a 9B parameter model.
Curated a 1M dataset using perplexity filtering for pre-training.
Used model soup to improve fine-tuning results.
Abstract
In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered…
Peer Reviews
Decision·Submitted to ICLR 2025
- Paper is clearly written. - Introduced approaches are technically sound.
- Novelty concern: The proposed strong baseline model primarily integrates multiple existing advancements in vision-language models, with limited novel technical contributions or findings. Specifically, the proposed CATTY encoding combines dynamic high-resolution encoding and sliding window techniques, both of which have been explored in previous works [1-3]. Additionally, model soup or model ensemble techniques have been shown to effectively boost performance. The novelty of the introduced maxi
- The use of perplexity-based filtering of pre-training data is inspired by its success in large language models. This approach is a creative application of an existing concept in a new context, addressing data quality for vision-language models. - The paper is well-structured, with clear divisions between each method's presentation and the experimental results validating those methods. - The paper's focus on affordable strategies is practical for the broader community, especially for those with
- Each of the proposed methods, such as perplexity filtering, CATTY, and model soup, are either existing methods or slight modifications of existing approaches. This limits the overall novelty of the contributions. More significant deviations or novel techniques could make the paper's contributions more impactful. - While the authors claimed model soup has introduced improvements, the paper does not thoroughly explore its potential limitations or generalizability across different tasks or datase
- The proposed components are indeed effective and target the limitation of existing vision-language models in great detail. - Compared to finetuning dataset selection, the proposed model soup is a more sustainable design choice in terms of the return on performance with scale. - Perplexity-based dataset selection technique is simple and could have a potentially broader impact on data-efficient pretraining of large foundation models.
- While the paper is overall well-written, it is hard to spot the motivation behind the introduced components now and then. For instance, the introduction states "... However, recent advancements have rendered its (LLaVa's) performance suboptimal. Thus, there is a need to establish a stronger baseline for further exploration..." Without proper references or an explicit description of how LLaVa's performance has gone suboptimal, the current motivation reads as "We know the model x and we want t
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling
