Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency

Filippo Morbiato; Luca Romano; Alessandro Persona

arXiv:2511.10671·cs.CL·November 17, 2025

Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency

Filippo Morbiato, Luca Romano, Alessandro Persona

PDF

Open Access

TL;DR

This paper presents Grounded Visual Factualization (GVF) Finetuning, a novel method that significantly improves the factual consistency of Multimodal Large Language Models by integrating explicit factual signals and penalizing inaccuracies.

Contribution

The paper introduces GVF Finetuning, combining factual anchor data augmentation, fact-aware instruction tuning, and a factual consistency loss to enhance MLLM factual accuracy.

Findings

01

GVF outperforms standard fine-tuning on VHTest benchmark.

02

Maintains or improves performance on general multimodal benchmarks.

03

Effectively reduces visual hallucinations without harming reasoning abilities.

Abstract

Visual hallucination, where Multimodal Large Language Models fabricate details inconsistent with image content, critically undermines their reliability. Existing fine-tuning methods offer limited improvement, failing to deeply intervene in factual reasoning. This paper introduces Grounded Visual Factualization (GVF) Finetuning, a novel approach to systematically enhance MLLM visual factual consistency. GVF integrates explicit factual signals via three core mechanisms: Factual Anchor Data Augmentation, enriching training data with structured factual anchors and counter-factual prompts; Fact-Aware Instruction Tuning, embedding these cues into explicit instructions; and a Factual Consistency Loss function, specifically penalizing factual inaccuracies. Evaluated on LLaVA-1.5-13B, GVF Finetuning significantly outperforms standard fine-tuning on the VHTest benchmark for both Open-Ended…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning