VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment
Ziheng Jia, Linhan Cao, Jinliang Han, Zicheng Zhang, Jiaying Qian, Jiarui Wang, Zijian Chen, Guangtao Zhai, Xiongkuo Min

TL;DR
This paper introduces VITAL, a vision-encoder-centered pre-training approach for large multi-modal models in visual quality assessment, emphasizing versatility, transferability, and efficient training on a large dataset.
Contribution
The paper presents a novel vision-encoder-focused pre-training pipeline, a large-scale dataset, and a multi-task training method that enhances generalization and efficiency for VQualA LMMs.
Findings
Constructed the largest VQualA dataset with 4.5M vision-language pairs.
Achieved strong zero-shot performance with minimal fine-tuning data.
Enhanced model versatility and transferability across image and video modalities.
Abstract
Developing a robust visual quality assessment (VQualA) large multi-modal model (LMM) requires achieving versatility, powerfulness, and transferability. However, existing VQualA LMMs typically focus on a single task and rely on full-parameter fine-tuning, which makes them prone to overfitting on specific modalities or task types, thereby limiting their generalization capacity and transferability. To address this, we propose a vision-encoder-centered generative pre-training pipeline and develop the VITAL-Series LMMs. (1) We adopt a machine-executed annotation-scrutiny paradigm, constructing over 4.5M vision-language (VL) pairs-the largest VQualA training dataset to date. (2) We employ a multi-task training workflow that simultaneously enhances the model's quantitative scoring precision and strengthens its capability for quality interpretation across both image and video modalities. (3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Visual Attention and Saliency Detection · Image Enhancement Techniques
