Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation

Jihao Gu; Yingyao Wang; Meng Cao; Pi Bu; Jun Song; Yancheng He; Shilong Li; Bo Zheng

arXiv:2412.14487·cs.CV·September 24, 2025

Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation

Jihao Gu, Yingyao Wang, Meng Cao, Pi Bu, Jun Song, Yancheng He, Shilong Li, Bo Zheng

PDF

Open Access 1 Video

TL;DR

This paper introduces TPO, a novel token preference optimization method with self-calibrated visual-anchored rewards, significantly reducing hallucinations in large vision-language models by focusing on visual-correlated tokens without detailed annotations.

Contribution

The paper presents a new TPO model that adaptively emphasizes visual-anchored tokens using self-calibrated rewards, improving hallucination mitigation in LVLMs.

Findings

01

Achieves state-of-the-art hallucination mitigation performance.

02

Boosts performance on hallucination benchmarks when built on LLAVA-1.5-7B.

03

Effectively attends to visual-correlated tokens without fine-grained annotations.

Abstract

Direct Preference Optimization (DPO) has been demonstrated to be highly effective in mitigating hallucinations in Large Vision Language Models (LVLMs) by aligning their outputs more closely with human preferences. Despite the recent progress, existing methods suffer from two drawbacks: 1) Lack of scalable token-level rewards; and 2) Neglect of visual-anchored tokens. To this end, we propose a novel Token Preference Optimization model with self-calibrated rewards (dubbed as TPO), which adaptively attends to visual-correlated tokens without fine-grained annotations. Specifically, we introduce a token-level \emph{visual-anchored} \emph{reward} as the difference of the logistic distributions of generated tokens conditioned on the raw image and the corrupted one. In addition, to highlight the informative visual-anchored tokens, a visual-aware training objective is proposed to enhance more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation· underline

Taxonomy

TopicsPsychedelics and Drug Studies · Functional Brain Connectivity Studies · Hallucinations in medical conditions