Continual SFT Matches Multimodal RLHF with Negative Supervision

Ke Zhu; Yu Wang; Yanpeng Sun; Qiang Chen; Jiangjiang Liu; and Gang Zhang; Jingdong Wang

arXiv:2411.14797·cs.LG·November 25, 2024

Continual SFT Matches Multimodal RLHF with Negative Supervision

Ke Zhu, Yu Wang, Yanpeng Sun, Qiang Chen, Jiangjiang Liu, and Gang Zhang, Jingdong Wang

PDF

Open Access

TL;DR

This paper introduces a negative supervised finetuning (nSFT) method that leverages negative supervision from multimodal RLHF to improve vision-language models more efficiently than traditional RLHF approaches.

Contribution

The paper proposes nSFT, a novel approach that exploits negative supervision in RLHF for better alignment of vision-language models with less memory usage.

Findings

01

nSFT outperforms traditional multimodal RLHF methods across various datasets and metrics.

02

nSFT is more memory-efficient than existing RLHF approaches requiring multiple large models.

03

Ablation studies support the effectiveness of negative supervision in model alignment.

Abstract

Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models' (VLMs) comprehension. Conventional wisdom holds its superiority over continual SFT during this preference alignment stage. In this paper, we observe that the inherent value of multimodal RLHF lies in its negative supervision, the logit of the rejected responses. We thus propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided. Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss. This is more memory efficient than multimodal RLHF where 2 (e.g., DPO) or 4 (e.g., PPO) large VLMs are strictly required. The effectiveness of nSFT is rigorously proved by comparing it with various multimodal RLHF approaches, across different dataset sources, base VLMs and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPower Systems and Technologies

MethodsShrink and Fine-Tune · ALIGN · Balanced Selection