Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining
Han Huang, Yuqi Huo, Zijia Zhao, Haoyu Lu, Shu Wu, Bingning Wang,, Qiang Liu, Weipeng Chen, Liang Wang

TL;DR
This paper introduces AITQE, a dynamic approach to improve image-text pair quality in multimodal pretraining, enhancing data utilization and model scalability without heavily modifying textual data.
Contribution
The paper proposes AITQE, a novel adaptive method that assesses and enhances image-text pair quality through text rewriting and negative sampling, improving over filter-based approaches.
Findings
AITQE outperforms existing methods on benchmark datasets.
It effectively leverages raw data and scales with data volume.
Minimal text adjustments preserve data diversity.
Abstract
Multimodal large language models (MLLMs) have made significant strides by integrating visual and textual modalities. A critical factor in training MLLMs is the quality of image-text pairs within multimodal pretraining datasets. However, filter-based data quality enhancement paradigms often discard a substantial portion of high-quality image data due to inadequate semantic alignment between images and texts, leading to inefficiencies in data utilization and scalability. In this paper, we propose the Adaptive Image-Text Quality Enhancer (AITQE), a model that dynamically assesses and enhances the quality of image-text pairs. AITQE employs a text rewriting mechanism for low-quality pairs and incorporates a negative sample learning strategy to improve evaluative capabilities by integrating deliberately selected low-quality samples during training. Unlike prior approaches…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing
