Better Supervised Fine-tuning for VQA: Integer-Only Loss
Baihong Qian, Haotian Fan, Wenjie Liao, Yunqiu Wang, Tao Li, and Junhui Cui

TL;DR
This paper introduces IOVQA, a novel fine-tuning method for vision language models that uses integer-only labels and a targeted loss to improve video quality assessment accuracy and consistency.
Contribution
The paper presents a new integer-only label construction and loss calculation strategy for fine-tuning VLMs, enhancing their performance in quantitative evaluation tasks.
Findings
Achieved 3rd place in VQualA 2025 GenAI-Bench challenge.
Significantly improved VQA accuracy and consistency.
Demonstrated effectiveness of integer-only labels in fine-tuning.
Abstract
With the rapid advancement of vision language models(VLM), their ability to assess visual content based on specific criteria and dimensions has become increasingly critical for applications such as video-theme consistency assessment and visual quality scoring. However, existing methods often suffer from imprecise results and inefficient loss calculation, which limit the focus of the model on key evaluation indicators. To address this, we propose IOVQA(Integer-only VQA), a novel fine-tuning approach tailored for VLMs to enhance their performance in video quality assessment tasks. The key innovation of IOVQA lies in its label construction and its targeted loss calculation mechanism. Specifically, during dataset curation, we constrain the model's output to integers within the range of [10,50], ensuring numerical stability, and convert decimal Overall_MOS to integer before using them as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
