Synth-Align: Improving Trustworthiness in Vision-Language Model with Synthetic Preference Data Alignment
Robert Wijaya, Ngoc-Bao Nguyen, Ngai-Man Cheung

TL;DR
Synth-Align introduces a synthetic data generation pipeline for aligning vision-language models, significantly reducing hallucinations and improving safety, robustness, and instruction-following in multimodal AI systems.
Contribution
The paper presents a novel synthetic preference data generation method tailored for post-training alignment of vision-language models using reward models and DPO.
Findings
Enhanced LLaVA-1.5-7B achieved 87.6% accuracy in POPE
Hallucination rate reduced from 51.0% to 25.0%
Improved MMHal-Bench score from 2.36 to 3.49
Abstract
Large Vision-Language Models (LVLMs) have shown promising capabilities in understanding and generating information by integrating both visual and textual data. However, current models are still prone to hallucinations, which degrade the performance and greatly harm the user experience in real-world applications. Post-training alignment, particularly preference-tuning, is intended to align model outputs and behaviors (safety, instruction-following, style), ensuring robustness and adaptability to a wide range of tasks. The use of synthetic data for alignment, particularly in multimodal settings, remains under explored. Existing approaches typically use a strong model or a ground-truth model (CLIP) to determine positive and negative image-text data points. This paper proposes SynthAlign, a pipeline to generate and collect synthetic human-preference image-text data with optimal control…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms
MethodsLinear Layer · Dense Connections · Residual Connection · Adam · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Dropout · Softmax
