Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

Hongzhe Huang; Jiang Liu; Zhewen Yu; Li Cai; Dian Jiao; Wenqiao Zhang; Siliang Tang; Juncheng Li; Hao Jiang; Haoyuan Li; Yueting Zhuang

arXiv:2409.18541·cs.AI·December 5, 2025

Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

Hongzhe Huang, Jiang Liu, Zhewen Yu, Li Cai, Dian Jiao, Wenqiao Zhang, Siliang Tang, Juncheng Li, Hao Jiang, Haoyuan Li, Yueting Zhuang

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel instruction curation method for multi-modal large language models that combines human and LLM preference alignment, significantly compressing training data while maintaining or improving model performance.

Contribution

It introduces a dual perspective alignment approach, using human expert criteria and LLM style alignment to create high-quality, compact instruction datasets for MLLMs.

Findings

01

Compressed training instructions by up to 90%.

02

Reduced dataset size from 158k to 14k instructions.

03

Achieved better performance than models trained on full datasets.

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs), such as LLaVA-series models, are driven by massive machine-generated instruction-following data tuning. Such automatic instruction collection pipelines, however, inadvertently introduce significant variability in data quality. This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment, to compress this vast corpus of machine-generated multimodal instructions to a compact and high-quality form: (i) For human preference alignment, we have collected a machine-generated multimodal instruction dataset and established a comprehensive set of both subjective and objective criteria to guide the data quality assessment critically from human experts. By doing so, a reward model was trained on the annotated dataset to internalize the nuanced human understanding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dcdmllm/align2llava
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training · ALIGN