Improving Alignment in LVLMs with Debiased Self-Judgment

Sihan Yang; Chenhang Cui; Zihao Zhao; Yiyang Zhou; Weilong Yan; Ying Wei; Huaxiu Yao

arXiv:2508.20655·cs.CV·September 12, 2025

Improving Alignment in LVLMs with Debiased Self-Judgment

Sihan Yang, Chenhang Cui, Zihao Zhao, Yiyang Zhou, Weilong Yan, Ying Wei, Huaxiu Yao

PDF

Open Access

TL;DR

This paper introduces a novel internal self-evaluation method for LVLMs that reduces hallucinations and improves alignment without external data, enhancing safety and performance.

Contribution

The paper proposes a debiased self-judgment score that enables LVLMs to self-evaluate and improve alignment autonomously, reducing reliance on external datasets.

Findings

01

Significantly reduces hallucinations in LVLM outputs.

02

Improves safety and alignment quality.

03

Outperforms traditional alignment methods.

Abstract

The rapid advancements in Large Language Models (LLMs) and Large Visual-Language Models (LVLMs) have opened up new opportunities for integrating visual and linguistic modalities. However, effectively aligning these modalities remains challenging, often leading to hallucinations--where generated outputs are not grounded in the visual input--and raising safety concerns across various domains. Existing alignment methods, such as instruction tuning and preference tuning, often rely on external datasets, human annotations, or complex post-processing, which limit scalability and increase costs. To address these challenges, we propose a novel approach that generates the debiased self-judgment score, a self-evaluation metric created internally by the model without relying on external resources. This enables the model to autonomously improve alignment. Our method enhances both decoding…

Tables18

Table 1. Table 1: CHAIR evaluation results on the MSCOCO dataset of LVLMs with different decoding baselines and methods designed to reduce object hallucinations. Lower CHAIR S and CHAIR I scores indicate less object hallucinations, while higher BLEU scores generally reflect better captioning quality.

	LLaVA-1.5			InstructBLIP			mPLUG-Owl2
Method	CHAIR_S $↓$	CHAIR_I $↓$	BLEU $↑$	CHAIR_S $↓$	CHAIR_I $↓$	BLEU $↑$	CHAIR_S $↓$	CHAIR_I $↓$	BLEU $↑$
Greedy	22.4	5.8	0.249	29.0	12.9	0.217	23.1	8.4	0.279
Beam Search	19.6	6.3	0.247	31.8	14.3	0.228	22.5	8.1	0.280
DoLA	21.0	6.7	0.256	30.0	9.1	0.238	22.0	7.8	0.283
OPERA	26.4	7.8	0.210	26.0	8.2	0.251	18.6	6.6	0.286
VCD	20.7	5.3	0.247	25.8	7.1	0.244	25.5	9.2	0.273
Woodpecker	17.5	4.0	0.259	28.0	11.0	0.249	20.0	7.3	0.286
LURE	18.0	4.5	0.253	31.0	11.9	0.251	16.4	6.4	0.283
HALC	15.9	3.5	0.255	27.2	10.3	0.253	21.1	7.4	0.298
DSGD	15.2	4.0	0.263	20.1	6.9	0.271	14.2	4.5	0.300

Table 2. Table 2: Comparison of different methods on FaithScore and Sentence-level FaithScore with LLaVA-1.5-7B.

Method	F-Score ↑	F-Score_S ↑
Greedy	84.6	66.3
VCD	85.2	63.1
Opera	88.4	67.9
HALC	86.3	67.8
LURE	88.8	67.4
Woodpecker	86.2	66.5
DSGD	89.3	75.1

Table 3. Table 3: Attack success rate (ASR) of different defense methods on various models on MM-SafetyBench. The last column represents the average of the 6 categories (IA, HS, MG, Fr, Po, PV). We also present the Misclassification Rate (MCR), defined as the proportion of safe responses incorrectly classified as unsafe.

	Method	MCR ↓	IA ↓	HS ↓	MG ↓	Fr ↓	Po ↓	PV ↓	Avg ↓
LLaVA-1.5	Vanilla	-	89.7	65.0	63.6	74.0	78.0	68.3	73.1
	ECSO	0	37.1	20.2	20.5	31.2	63.3	35.3	34.6
	FGSD (Ours)	0	15.3	26.2	17.9	15.6	21.8	18.9	19.3
InstructBLIP	Vanilla	-	69.1	44.1	45.5	43.5	43.1	49.6	49.2
	ECSO	14.6	-	-	-	-	-	-	-
	FGSD (Ours)	0	17.8	18.6	20.3	24.5	40.1	33.5	25.8
mPLUG-Owl2	Vanilla	-	94.8	81.6	81.8	85.7	75.2	88.5	84.6
	ECSO	0	22.7	28.2	38.6	24.0	69.7	86.3	44.9
	FGSD (Ours)	0	13.7	19.1	33.0	12.4	38.5	31.2	24.7

Table 4. Table 4: Performance comparison between DSR and other baselines on LLaVA-1.5-7B across comprehensive benchmarks, general VQA, and hallucination benchmarks. The results in bold and underline are the best and second-best results, respectively.

	Comprehensive Benchmark					General VQA			Hallucination Benchmark
Method	MME ↑	SEED ↑	LLaVA^W ↑	MMB ↑	MM-Vet ↑	SQA^I ↑	VisWiz ↑	GQA ↑	POPE ↑	CHAIR_S ↓	CHAIR_I ↓
LLaVA-1.5 7B	1858.9	58.6	63.4	64.3	30.5	66.8	50.0	62.0	85.9	48.8	14.9
+ Silkie	1754.5	59.3	62.1	64.0	31.2	66.2	52.6	63.2	83.7	40.3	13.2
+ LLaVA-RLHF	1825.6	58.1	63.7	63.4	31.1	65.8	51.7	61.3	81.5	38.7	11.3
+ POVID	1778.1	60.2	65.8	64.9	31.8	68.8	53.6	61.7	86.9	35.2	8.3
+ RLHF-V	1838.6	60.1	65.4	63.6	30.9	67.1	54.2	62.1	86.2	29.7	7.5
+ RLAIF-V	-	-	-	-	-	-	-	-	-	21.2	4.7
+ CSR	1851.5	60.6	66.0	64.3	32.1	68.5	53.1	61.8	86.9	30.6	8.2
+ DSR (Ours)	1879.8	60.8	66.3	64.5	32.1	69.2	54.2	62.1	87.1	27.1	6.9

Table 5. Table 5: Ablation study on scoring components. "w/o Self-Judgment" represents randomly selecting a sentence from the candidates, while "w/o Debiasing" indicates the removal of the Score Debiasing step.

Methods	CHAIR_S ↓	CHAIR_I ↓
w/o Self-Judgment	24.4	8.0
w/o Debiasing	19.0	6.2
DSGD	15.2	5.0

Table 6. Table 6: Training hyperparameters.

Hyperparameters
lora_r	128
lora_alpha	256
lora_target	all
mm_projector_lr	2e-5
Batch size	1
Learning rate	1e-7
model_max_length	1024

Table 7. Table 7: Efficiency Measurement of DSGD and baselines on CHAIR 64 benchmark.

	Require finetuning	Require external tool	Only work for image captioning	Execution time(s)
Greedy	×	×	×	1.1
Beam Search	×	×	×	2.0
DoLA	×	×	×	10.5
VCD	×	×	✓	9.9
Opera	×	×	✓	12.5
POVID	✓	×	×	1.2
LURE	✓	×	✓	3.9
WoodPecker	×	✓	×	N/A
DSGD(Ours)	×	×	×	3.5

Table 8. Table 8: Spearman’s rank correlation coefficients for self-judgment scores across models.

Model	Self-Judgment vs. FaithScore	Self-Judgment vs. Blind Self-Judgment
LLaVA-1.5-7B	0.673	0.273
InstructBLIP	0.629	0.371
mPLUG-Owl2	0.750	0.296

Table 9. Table 9: Performance comparison of LLaVA-1.5-7B and DSR with different preference data scales across multiple benchmarks. The best results in each column are highlighted in bold.

Method	MME_P ↑	MME_C ↑	SEED ↑	LLaVA^W ↑	MMB ↑	MM-Vet ↑	SQA^I ↑	VisWiz ↑	GQA ↑	POPE ↑	CHAIR_S ↓	CHAIR_I ↓
LLaVA-1.5-7B	1510.7	348.2	58.6	63.4	64.3	30.5	66.8	50.0	62.0	85.9	48.8	14.9
6K Data	1500.6	379.2	60.8	66.3	64.5	32.1	69.2	54.2	62.1	87.1	27.1	6.9
10K Data	1508.2	380.5	61.3	66.7	64.7	32.7	69.5	55.0	62.1	87.1	25.8	6.1

Table 10. Table 10: Performance comparison of VILA with different preference data curation methods on multiple benchmarks.

Method	MME ↑	SEED ↑	LLaVA^W ↑	MMB ↑	MM-Vet ↑	SQA^I ↑	VisWiz ↑	GQA ↑	POPE ↑	CHAIR_S ↓	CHAIR_I ↓
VILA	1849.4	61.1	69.7	68.9	34.9	68.2	57.8	62.3	85.50	31.0	8.8
+ CSR	1852.5	63.2	73.5	69.3	38.3	71.9	62.3	62.2	86.82	29.2	7.9
+ DSR	1875.5	63.2	73.9	69.7	38.4	72.4	61.0	62.5	86.96	28.5	7.4

Table 11. Table 11: The effect of α \alpha on FGSD.

	Method	MCR ↓	IA ↓	HS ↓	MG ↓	Fr ↓	Po ↓	PV ↓	Avg ↓
LLaVA-1.5	w/o Defense	-	89.7	65.0	63.6	74.0	78.0	68.3	73.1
	$α = 1$	0	16.5	27.5	18.0	18.8	22.3	20.5	20.6
	$α = 0.1$	0	11.3	21.4	13.3	11.0	17.4	14.3	14.8

Table 12. Table 12: Ablation study of Fine-Grained Self-Defense (FGSD) on MM-SafetyBench.

	Method	MCR ↓	IA ↓	HS ↓	MG ↓	Fr ↓	Po ↓	PV ↓	Avg ↓
LLaVA-1.5	w/o Defense	-	89.7	65.0	63.6	74.0	78.0	68.3	73.1
	w/o Debiasing	0	13.4	21.9	15.1	12.0	18.9	17.5	16.5
	FGSD (Ours)	0	11.3	21.4	13.3	11.0	17.4	14.3	14.8

Table 13. Table 13: Ablation study of Debiased Self-Rewarding (DSR). ‘w/o Debiasing’ refers to the self-judgment scoring process without debiasing, while ‘w/o ILJ’ refers to the data construction process without instance-level judgment. Both settings are subsequently used for Direct Preference Optimization (DPO) training.

	Comprehensive Benchmark						General VQA			Hallucination Benchmark
Method	MME^P	MME^C	SEED	LLaVA^W	MMB	MM-Vet	SQA^I	VisWiz	GQA	POPE	CHAIR_S	CHAIR_I
LLaVA-1.5-7B	1510.7	348.2	58.6	63.4	64.3	30.5	66.8	50.0	62.0	85.9	48.8	14.9
w/o Debiasing	1495.3	370.0	60.6	65.8	64.3	32.0	69.3	54.0	61.7	86.7	30.2	9.1
w/o ILJ	1494.4	369.7	60.7	66.0	64.5	32.0	68.8	54.1	62.0	86.7	28.9	7.7
+ DSR (Ours)	1500.6	379.2	60.8	66.3	64.5	32.1	69.2	54.2	62.1	87.1	27.1	6.9

Table 14. Table 14: Prompt for DSGD: utilizing the model itself as a judge to determine the faithfulness of the generated response.

Please analyze the provided image and determine if the following description is accurate.

The description is:{description}

“yes” or “no” or “maybe”.

Table 15. Table 15: Prompt for FGSD: utilizing the model itself as a judge to determine the safety of the generated response.

Answer: {text}

Does the Answer contains harmful and unethical suggestions?

“yes” or “no” or “maybe”.

Table 16. Table 16: Prompt for DSR for question answering-type data: utilizing the model itself as a judge to determine the accuracy of the answer.

Please analyze the provided image and determine if the answer to the question is correct.

Question: {qs}

Answer: {ans}

“yes” or “no” or “maybe”.

Table 17. Table 17: Prompt for DSR for detailed description-type data: leveraging the model itself as a judge to assess the accuracy of the description.

Please analyze the provided image and determine if the answer to the question is correct.

The description is: {description}

“yes” or “no” or “maybe”.

Table 18. Table 18: Prompt for instance-level self-judgment: utilizing the model itself as a judge to determine whether the answer to the question is correct.

Please analyze the provided image and determine if the answer to the question is correct.

Question: {qs}

Answer: {ans}

“yes” or “no” or “maybe”.

Equations15

S cor e_{f} = logit_{θ} (cls ∣ prompt_{f}, v, a),

S cor e_{f} = logit_{θ} (cls ∣ prompt_{f}, v, a),

S cor e_{f}^{'} = logit_{θ} (cls ∣ prompt_{f}, a),

S cor e_{f}^{'} = logit_{θ} (cls ∣ prompt_{f}, a),

S_{f} = (1 + α) S cor e_{f} - α S cor e_{f}^{'} .

S_{f} = (1 + α) S cor e_{f} - α S cor e_{f}^{'} .

T=\frac{\left\lceil\max\big{\{}S_{u}(a_{1}),S_{u}(a_{2}),\dots,S_{u}(a_{n})\big{\}}\times 10\right\rceil}{10},

T=\frac{\left\lceil\max\big{\{}S_{u}(a_{1}),S_{u}(a_{2}),\dots,S_{u}(a_{n})\big{\}}\times 10\right\rceil}{10},

L =

L =

- α lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{r e f} ( y _{l} ∣ x )})],

CHAIR_{I} = \frac{∣ { hallucinated objects } ∣}{∣ { all mentioned objects } ∣},

CHAIR_{I} = \frac{∣ { hallucinated objects } ∣}{∣ { all mentioned objects } ∣},

CHAIR_{S} = \frac{∣ { captions with hallucinated objects } ∣}{∣ { all captions } ∣} .

CHAIR_{S} = \frac{∣ { captions with hallucinated objects } ∣}{∣ { all captions } ∣} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Mobile Agent-Based Network Management · Service-Oriented Architecture and Web Services

Full text

Improving Alignment in LVLMs with Debiased Self-Judgment

Sihan Yang1∗ Chenhang Cui2∗ Zihao Zhao1 Yiyang Zhou3 Weilong Yan2

Ying Wei1 Huaxiu Yao3

1 Nanyang Technological University 2 National University of Singapore 3 UNC-Chapel Hill

[email protected] [email protected]

Abstract

The rapid advancements in Large Language Models (LLMs) and Large Visual-Language Models (LVLMs) have opened up new opportunities for integrating visual and linguistic modalities. However, effectively aligning these modalities remains challenging, often leading to hallucinations—where generated outputs are not grounded in the visual input—and raising safety concerns across various domains. Existing alignment methods, such as instruction tuning and preference tuning, often rely on external datasets, human annotations, or complex post-processing, which limit scalability and increase costs. To address these challenges, we propose a novel approach that generates the debiased self-judgment score, a self-evaluation metric created internally by the model without relying on external resources. This enables the model to autonomously improve alignment. Our method enhances both decoding strategies and preference tuning processes, resulting in reduced hallucinations, enhanced safety, and improved overall capability. Empirical results show that our approach significantly outperforms traditional methods, offering a more effective solution for aligning LVLMs. Code is at [https://github.com/sihany077/LVLM_Debiased

_Self_Judge](https://github.com/sihany077/LVLM_Debiased_Self_Judge).

Improving Alignment in LVLMs with Debiased Self-Judgment

** Sihan Yang1∗ Chenhang Cui2∗ Zihao Zhao1 Yiyang Zhou3 Weilong Yan2**

Ying Wei1 Huaxiu Yao3

1 Nanyang Technological University 2 National University of Singapore 3 UNC-Chapel Hill

[email protected] [email protected]

1 Introduction

Owing to the powerful capabilities of Large Language Models (LLMs) (Bai et al., 2023; Touvron et al., 2023; Chiang et al., 2023), Large Visual-Language Models (LVLMs) demonstrate impressive performance by effectively integrating visual inputs into the latent representation space of LLMs (Liu et al., 2023c; Ye et al., 2023a; Zhu et al., 2023). However, similar to LLMs, LVLMs face inherent alignment challenges, including hallucinations (where the generated content is not grounded in the image) (Li et al., 2023d; Liu et al., 2023a), and safety issues (Liu et al., 2024a; Pi et al., 2024), which negatively impact the application of LVLMs across various domains (Li et al., 2024; Liu et al., 2024b; Zhang et al., 2024).

To address misalignment in LVLMs, a growing body of recent research has explored enhancing model alignment by leveraging external tools or human annotations to assist with preference tuning (Yu et al., 2024b; Wang et al., 2024; Yu et al., 2024a) and inference (Yin et al., 2023; Lee et al., 2024). However, most prevailing approaches rely heavily on powerful external resources—such as advanced models like GPT (Achiam et al., 2023) or human experts—which can lead to substantial costs during both training and inference. Moreover, in a hypothetical future where an AI system requiring alignment surpasses both human intelligence and the capabilities of other models, supervision from humans or existing models may offer only limited effectiveness for such a superintelligent system.

In response to these challenges, we draw inspiration from the effective self-reflection abilities observed in LLMs (Kadavath et al., 2022) and explore how LVLMs can self-evaluate and enhance their alignment independently. We observe that the internal confidence of LVLMs can reflect the faithfulness of their output sentences, but it also incorporates significant textual priors. Building on this, we introduce the debiased self-judgment score, a sentence-level evaluation metric generated autonomously by the model without relying on external resources. This score is applied to both decoding and preference tuning. Our results show that this approach significantly enhances LVLMs’ performance, improving faithfulness, safety, and overall capability, as shown in Figure 1. In summary, our contributions are three-fold:

•

We demonstrate that leveraging LVLM’s intrinsic confidence as a self-judgment score is effective, but it is influenced by strong textual priors. To address this, we propose a debiasing method for the self-judgment score.

•

The debiased self-judgment score is used to guide decoding, resulting in more faithful and safer outputs. It is also applied to self-improvement training, improving model performance across multiple dimensions.

•

Experiments on hallucination, safety, and comprehensive benchmarks across different LVLMs validate our method’s effectiveness.

2 Related Work

2.1 Alignment in LVLMs

LVLMs demonstrate exceptional performance across a range of tasks (Liu et al., 2024b; Li et al., 2024; Zhang et al., 2024). However, they remain vulnerable to misalignment issues, which can lead to significant challenges such as safety concerns and hallucinations. To mitigate hallucinations, several methods have been proposed, including instruction tuning (Liu et al., 2023a), decoding strategies (Leng et al., 2024; Huang et al., 2024; Park et al., 2024; Chen et al., 2024b), preference fine-tuning (Sun et al., 2023; Yu et al., 2023a), and improved vision encoders (Jain et al., 2024). To tackle safety challenges, researchers have employed strategies such as fine-tuning for safety (Chen et al., 2024a; Pi et al., 2024), adopting robust architectures (Hossain and Imteaj, 2024), and evaluating responses with the assistance of other models (Ding et al., 2024). Despite these advancements, most existing methods rely on external models or tools, limiting scalability and introducing potential biases. In contrast, our approach leverages internal model capabilities to generate more faithful, safe responses and improve overall LVLM performance, without external resources.

2.2 Judgment in LLMs and LVLMs

The LLM-as-a-Judge (Zheng et al., 2023) paradigm has become a widely adopted method for evaluating the quality of outputs from large language models Wang et al. (2023); Yuan et al. (2024); Chan et al. (2023). This approach typically involves using one language model to assess the outputs of another Kim et al. (2023); Chan et al. (2023); Chang et al. (2024), providing a scalable alternative to traditional human evaluation. Beyond language models, LVLM judges have also been widely applied for various purposes, such as evaluating LVLM performance (Xiong et al., 2024; Jing et al., 2023), correcting unfaithful outputs during inference (Lee et al., 2024), and generating preference data to improve the overall performance of LVLMs (Wang et al., 2024). However, these methods often rely on powerful models (e.g., RLAIF-V (Yu et al., 2024b)), additional training of the judge model (e.g., Volcano (Lee et al., 2024), LLaVA-Critic (Xiong et al., 2024)), or human annotations (e.g., SIMA (Wang et al., 2024)), which limit scalability and introduce additional costs. In contrast, our proposed approach harnesses the models’ intrinsic confidence to accurately assess LVLMs’ outputs. This shows the potential of LVLMs’ self-judgment for inference and preference data generation, without external models or human annotation.

3 Preliminary Observations

In this section, we present preliminary findings on the potential and limitations of LVLMs’ self-judgment abilities, which serve as the foundation for our proposed debiased self-judgment score.

3.1 Potential of LVLMs for Self-Judgment

Previous research (Kadavath et al., 2022; Phute et al., 2023) shows that LLMs can sometimes evaluate the accuracy of their own responses, offering a scalable way to assess model outputs. Inspired by this, we explore whether LVLMs can self-evaluate to improve alignment and output quality. Specifically, we focus on faithfulness—the correspondence between image descriptions and visual content—as it is a key aspect of alignment in LVLMs. We use LLaVA-1.5 7B (Liu et al., 2023b) to generate one description for each of 500 randomly selected images from the MSCOCO dataset (Lin et al., 2014). To objectively measure the faithfulness of these descriptions, we calculate the FaithScore (Jing et al., 2023), defined as the proportion of correct atomic facts to total atomic facts in a description (a score closer to 1 indicates higher faithfulness). To enable the LVLM to self-assess description faithfulness, we use the prompt “Is the description accurate?” and extract the logit for the “Yes” response as the self-judgment score. The correlation between self-judgment scores and FaithScores is illustrated in Figure 2 (Top).

The figure shows a positive correlation between self-judgment scores and FaithScores, indicating higher confidence often corresponds to more accurate descriptions. However, the moderate correlation suggests that self-judgment alone may not fully capture faithfulness, requiring further refinement.

3.2 LVLMs’ Limitations in Self-Judgment

LVLMs build on the advanced text-generation capabilities of LLMs to create multimodal frameworks, yet they inherit unimodal biases from these language models. For example, prior research (Leng et al., 2024; Han et al., 2022; Li et al., 2023d) indicates that LVLMs tend to overlook image content and overly rely on text-based priors when generating descriptions.

We further investigate whether these unimodal biases affect the LVLMs’ ability to assess the faithfulness of their outputs. Specifically, we reuse the 500 image descriptions and their corresponding self-judgment scores obtained in Section 3.1. To isolate the model’s text-based priors, we remove the images and have the same LVLM evaluate the faithfulness of the sentences using the self-judgment method described in Section 3.1. This generates scores (referred to as blind self-judgment scores) that represent the model’s text-based priors.

As shown in Figure 2 (Bottom), the moderate positive correlation between the LVLM’s self-judgment scores and the blind self-judgment scores suggests that the model’s self-judgment is biased toward the textual modality, rather than reflecting true multimodal faithfulness. Quantitative analyses on more models are provided in Appendix C.1.

4 Method

In this section, we propose a method that leverages the model’s internal confidence for self-judgment and eliminates text modality bias, resulting in a

debiased self-judgment score. This score is used for decoding and preference tuning to enhance LVLMs’ faithfulness, safety, and overall capability. Specifically, Section 4.1 describes how to derive the debiased self-judgment score and apply it to generate more faithful descriptions; Section 4.2 incorporates the score with a safety prefix to prevent unsafe outputs; and Section 4.3 investigates how both sentence-level and instance-level self-judgment contribute to self-improvement training.

4.1 Deriving the Debiased Self-Judgment Score and Its Application in Decoding for Faithfulness

In this section, using faithfulness evaluation as an example, we introduce a method that leverages the model’s internal confidence to perform self-judgment and mitigate text modality bias, resulting in the debiased self-judgment score. This score is then applied in the decoding process through Debiased Self-Guided Decoding (DSGD) to prioritize visually grounded content and enhance faithfulness. The process is divided into three main components (shown in Figure 3 : Top):

Self-Judgment Scoring. By leveraging the intrinsic confidence of LVLMs, we have the model self-judge its own outputs at the sentence level for factual accuracy. For a sentence $a$ generated by the LVLM, we use a $\text{prompt}_{f}$ , such as “Is the description accurate?”, to guide the LVLM in evaluating the faithfulness of sentence $a$ based on the image $v$ . We compute the initial faithfulness score, $Score_{f}$ , as the sum of the logits for the tokens “Yes” and “yes” from the LVLM’s next-token predictions:

[TABLE]

where cls represents the tokens “Yes” and “yes”.

Score Debiasing. Notably, as our observations in Section 3.2 reveal, LVLMs inherit bias toward text from Large Language Models, which can lead to inaccurate judgment of their own generated sentences in certain cases. To mitigate this text bias in $Score_{f}$ , we introduce a score debiasing process, as illustrated in Figure 4. Specifically, we first feed the prompt guiding the LVLM to judge and the sentence being judged, without an image, to get logits $l^{\prime}$ , which contains only text priors. Then, using the same method as Self-Judgment Scoring, we compute $Score_{f}^{\prime}$ as follows:

[TABLE]

where cls represents “Yes” and “yes”. Finally, to reduce the influence of text modality bias, we employ a contrastive objective to obtain the final faithfulness score:

[TABLE]

Guided Sentence Generation. In this approach, the generation process is guided by the debiased self-judgment scores to maintain alignment between the generated descriptions and the visual content. We adopt a sentence-by-sentence generation strategy, using debiased self-judgment scores to select each sentence in order to maintain fluency and faithfulness to the image. To minimize the cost of inference, we employ a greedy search strategy for sentence selection. At each step $t$ , given the partially generated description $c_{t}=(a_{1},a_{2},\dots,a_{t})$ , the model generates $N$ candidate sentences $\{a_{t+1}^{1},a_{t+1}^{2},\dots,a_{t+1}^{N}\}$ for the next sentence $a_{t+1}$ . The candidate with the highest faithfulness score $S_{f}$ is selected as $a_{t+1}$ and appended to $c_{t}$ . This process continues until an EOS token is reached.

4.2 Self-Defence for Safety

This section presents an application of the debiased self-judgment Score for detecting and moderating unsafe content in LVLMs’ responses, utilizing a fine-grained unsafe score and a safety prefix. This process (shown in Figure 3: Bottom), referred to as Fine-Grained Self-Defense (FGSD), is composed of three key components:

Unsafety Scoring. To evaluate the safety of LVLMs’ responses more precisely, we adopt a sentence-level judgment and leverage the LVLM’s intrinsic ability for self-judgment to achieve higher accuracy while maintaining the model’s utility as much as possible. This section follows the methodology described in Section 4.1 to obtain the unsafe score. The Unsafety Scoring process uses a prompt, such as “Is the response unsafe?”, to calculate the unsafe score $S_{u}$ . Notably, in some cases, LVLMs cannot determine the safety of a response without visual input, as the assessment relies heavily on image context (see example in Appendix C.6), highlighting the need to mitigate text bias.

Unsafety Threshold Setting. When using the unsafe score $S_{u}$ to assess the safety of a sentence, it is important to set an appropriate threshold to distinguish between safe and unsafe sentences. This helps reduce unsafe outputs while maintaining the model’s utility. We first generate 1,000 safe responses using prompts from the model’s training dataset (or from the LLaVA-1.5 training set if the model’s training dataset is not publicly available). These safe responses are then scored at the sentence level using the method described in Unsafety Scoring. The final threshold is set as the maximum unsafe

score observed among all verified safe sentences, rounded up to one decimal place. This adjustment provides a margin to prevent the model’s safe outputs from being misclassified as unsafe. The threshold can be formulated as follows:

[TABLE]

where $a_{1},a_{2},\dots,a_{n}$ represent the sentences generated as safe responses from prompts sampled from general datasets. Here, $\lceil\cdot\rceil$ represents the ceiling function, which rounds a number up to the smallest integer greater than or equal to its value.

Unsafe Score-Guided Response Moderation. A sentence is considered as containing unsafe content if its unsafe score exceeds the threshold $T$ . Upon detecting an unsafe output, the response is prefixed with "Sorry, answering the question will generate harmful content, because". This prefix, together with the original prompt, is then provided back to the LVLM, prompting it to generate the subsequent tokens. Leveraging its autoregressive architecture, the LVLM is able to autonomously produce a coherent explanation for the refusal.

4.3 Dual Self-Judgment for More Significant Self-Improvement

In this section, we present a self-rewarding training paradigm for LVLMs, referred to as Debiased Self-Rewarding (DSR). We propose a dual self-judgment mechanism for preference tuning (shown in Figure 5), which includes: (1) using the debiased self-judgment score as a reward signal for sentence-level preference data generation, and (2) refining instance-level preference data quality through self-judgment. This mechanism generates high-quality preference data, which is used to fine-tune the LVLM via Direct Preference Optimization (Rafailov et al., 2024) to achieve self-improvement. The method is described as follows:

Preference Data Generation. We generate two types of preference data for training: question answering and detailed description. Similar to the setup in Sec 4.1, at each step, the sentence with the highest debiased self-judgment score is selected as the preferred response, and the sentence with the lowest score as the dispreferred response. The process continues by generating new sentence candidates based on the selected sentences until the EOS token is reached.

Data Cleaning. We notice that the preferred data contains incorrect responses, while the dispreferred data includes correct ones, which could undermine the model’s performance during training. To resolve this, we use the same LVLM to evaluate the correctness of responses at the instance level. If the LVLM outputs “Yes”, the response is considered correct; otherwise, it is deemed incorrect. Consequently, incorrect responses in the preferred data and correct responses in the dispreferred data are removed. The final preference data is defined as: $\mathcal{D}=\left\{\left(x^{(i)},y_{w}^{(i)},y_{l}^{(i)}\right)\right\}_{i=1}^{N}$ , where $y_{w}^{(i)}$ and $y_{l}^{(i)}$ denote the preferred and dispreferred responses for the input prompt $x^{(i)}$ .

Preference Tuning. After obtaining the cleaned preference data, we fine-tune the target LVLM using DPO. The loss of DPO is defined as:

[TABLE]

where the model policy $\pi_{\theta}$ is initialized from the base reference policy $\pi_{\text{ref}}$ , $\beta$ is a parameter controlling the deviation from $\pi_{\text{ref}}$ , and $\sigma$ denotes the logistic function.

5 Experiments

In this section, we evaluate the performance of the proposed debiased self-judgment score across various applications, aiming to answer the following questions: (1) Can DSGD effectively reduce hallucinations in LVLMs compared to other baselines? (2) Can FGSD reduce unsafe outputs while maintaining the utility of LVLMs? (3) Can DSR effectively enhance the comprehensive capabilities of LVLMs? (4) Are the self-judgment method and the debiasing method we designed truly effective?

5.1 Enhancing Faithfulness through DSGD

Experimental Settings. We evaluate our method’s performance on object hallucination using the CHAIR (Rohrbach et al., 2018) metric on the MSCOCO (Lin et al., 2014) dataset, while BLEU (Papineni et al., 2002) is used to assess overall generation quality. FaithScore (Jing et al., 2023) measures hallucinations involving objects, attributes, and relationships. For hallucination mitigation during inference, we test six methods: Dola (Chuang et al., 2023), VCD (Leng et al., 2024), Opera (Huang et al., 2024), LURE (Zhou et al., 2023), Woodpecker (Yin et al., 2023), and HALC (Chen et al., 2024b), along with two conventional decoding strategies—greedy decoding and beam search. The experiments are conducted on LLaVA-1.5 (Liu et al., 2023b), InstructBLIP (Dai et al., 2023), and mPLUG-Owl2 (Ye et al., 2023b). For DSGD, we set $num\_beams$ to $5$ and $\alpha$ to $1$ . The self-judgment prompt is detailed in Appendix D, while baseline and benchmark details are in Appendix A.2 and Appendix A.3.

Results. The primary experimental results are summarized in Table 1. Our proposed DSGD method achieves state-of-the-art performance in hallucination mitigation during inference, significantly reducing object hallucinations with notable decreases in CHAIR scores (31.33% for LLaVA-1.5, 42.42% for InstructBLIP, and 47.63% for mPLUG-Owl2). In addition, DSGD improves BLEU scores, reflecting an overall improvement in captioning quality. Table 2 further reinforces these findings, showing that DSGD surpasses other methods across a comprehensive evaluation of hallucinations, including objects, attributes, and relationships. DSGD consistently delivers the best results on both FaithScore and Sentence-level FaithScore, underscoring its robustness in ensuring caption faithfulness.

5.2 Ensuring Safety via FGSD

Experimental Settings. To measure safety performance, we follow previous works by utilizing commonly employed subsets of the MM-SafetyBench (Liu et al., 2023d). To assess whether a method preserves the model’s original utility, we generate 1,000 safe responses using prompts from general datasets (see Appendix A.1.2 for details) and calculate the proportion of safe responses incorrectly classified as unsafe, reported as the Misclassification Rate (MCR). We use the same three LVLMs as described in the previous section. ECSO (Gou et al., 2024) is chosen as the baseline, due to its enhanced safety during the inference phase. For FGSD, we set $\alpha$ to $0.1$ for all experiments. The prompt used for judging safety is provided in Appendix D.

Results. The results in Table 3 show that FGSD consistently outperforms baseline methods across three models—LLaVA-1.5, InstructBLIP, and mPLUG-Owl2—on the MM-SafetyBench. FGSD achieves a significantly lower attack success rate (ASR) compared to the baseline without defense, reducing ASR by 73.6% for LLaVA-1.5, 47.6% for InstructBLIP, and 70.8% for mPLUG-Owl2, highlighting substantial safety improvement across these models. Although ECSO improves safety relative to no defense, it is less effective than FGSD. For InstructBLIP, ECSO reports a high misclassification rate (MCR) of 14.6%,

where many safe outputs are incorrectly flagged as unsafe, reducing the model’s practical utility. In contrast, FGSD achieves zero MCR across all models, maintaining both safety and utility without compromising output accuracy. These findings underscore FGSD’s superior ability to enhance the safety of LVLMs during inference, without sacrificing the model’s utility, as observed in ECSO.

5.3 Improving Overall Capability with DSR

Experimental Settings. To evaluate DSR in improving LVLM capability, we conduct experiments on three types of benchmarks: comprehensive benchmarks (MME (Fu et al., 2023), SEED-Bench (Li et al., 2023a), LLaVAW (Liu et al., 2023c), MMBench (Liu et al., 2024e), MM-Vet (Yu et al., 2024c)), general VQA tasks (ScienceQA (Lu et al., 2022), VisWiz (Gurari et al., 2018), GQA (Hudson and Manning, 2019)), and hallucination benchmarks (POPE (Li et al., 2023d), CHAIR (Rohrbach et al., 2018)). We utilize LLaVA-1.5 7B as the backbone model. For comparison, DSR is benchmarked against several data-driven preference learning methods, including Silkie (Li et al., 2023c), LLaVA-RLHF (Sun et al., 2023), POVID (Zhou et al., 2024a), RLHF-V (Yu et al., 2024a), and CSR (Zhou et al., 2024b). For DSR, we set $num\_beams$ to $5$ and $\alpha$ to $1$ . More implementation details are provided in Appendix A.1.3.

Results. As shown in Table 4, DSR significantly outperforms existing preference data curation methods that rely on external resources by delivering a more accurate reward signal through debiased self-judgment. Similarly, CSR, which depends on the CLIP to compute text-image similarity and employs a computationally expensive beam search algorithm, is also outperformed. Despite these methods leveraging powerful external resources, DSR achieves superior results to all baselines using only 6k training data—the smallest dataset among all methods. In comparison, CSR uses 13k data points , further underscoring the high quality of the data generated by DSR and the effectiveness of debiased self-judgment. To ensure a fair comparison and verify the effectiveness of DSR, we restrict our training data to no more than that used by any baseline in the experiments corresponding to Table 4. To further verify the scalability of DSR, additional results corresponding to training with larger-scale data are provided in Appendix C.2. To verify the generalizability of DSR, we apply it to a more advanced model, VILA (Lin et al., 2024), with detailed results provided in Appendix C.3.

5.4 Ablation studies

We conduct an ablation study to assess the impact of Self-Judgment and Score Debiasing on hallucination rates, as measured by CHAIRS and CHAIRI, within our proposed Debiased Self-Guided Decoding (DSGD) method. The results, summarized in Table 5, indicate that when Self-Judgment is removed and candidates are selected randomly instead of guided by the debiased self-judgment score, hallucination rates increase significantly. Similarly, when the Score Debiasing step is removed, which results in a higher reliance on text priors during the self-judgment process, the hallucination rates also rise. In contrast, the full DSGD approach, which integrates both Self-Judgment and Score Debiasing, achieves the lowest hallucination rates. These findings demonstrate the effectiveness of both components in ensuring more faithful image-grounded content generation. Further ablation studies on the effects of hyper-parameters in DSGD, along with the corresponding ablation results for FGSD and DSR, can be found in the Appendix C.4 and Appendix C.5.

6 Conclusion

In this paper, we propose a novel self-alignment method to solve the alignment problems in Large Vision-Language Models. By using a debiased self-judgment score, our approach enables the model to improve its vision-language alignment on its own, eliminating the need for external data or human intervention. Our extensive experiments demonstrate that this method reduces hallucinations and makes LVLMs safer and more powerful. The promising experimental results of our method indicate that self-judgment has considerable potential for enhancing alignment in LVLMs.

Limitations

In this work, we propose a debiased self-judgment score that guides both the decoding process and self-improvement training, enhancing the faithfulness and safety of LVLMs’ outputs, while also driving comprehensive improvements in their overall capabilities. However, our work still has limitations. Firstly, our method relies on accessing the model’s predicted token logits, which are often inaccessible in many closed-source models. This restricts its applicability to more powerful LLMs, such as GPT-4, which do not provide token likelihoods. Secondly, due to computational limitations, we only experimented with common LVLMs. Future work should include experiments on a broader range of models to further validate the effectiveness and generalizability of our approach. To fully understand the applicability of our method across all models, further experiments on a broader range of models are required. Thirdly, in the jailbreak attack experiments, we conducted tests solely in English, so we cannot guarantee the effectiveness of our method for other languages.

Ethical Considerations

In this work, we present a novel approach to improving the alignment of Large Visual-Language Models (LVLMs) using a debiased self-judgment score. While our method enhances faithfulness, safety, and overall performance, it is essential to address the ethical implications of our research to ensure the responsible development and deployment of LVLMs.

Mitigating Harmful Outputs A primary objective of our approach is to enhance the safety of LVLMs by reducing hallucinations and ensuring that generated outputs are grounded in visual inputs. This reduces the likelihood of disseminating inaccurate or misleading information. Furthermore, the Fine-Grained Self-Defense (FGSD) mechanism is specifically designed to detect and moderate unsafe content, thereby minimizing the risk of generating harmful, unethical, or illegal outputs.

However, despite these advancements, there are scenarios where the model may fail to identify or mitigate unsafe outputs, particularly in cases involving nuanced ethical dilemmas or adversarial attacks. Strengthening the robustness of safety mechanisms across diverse and complex scenarios remains an ongoing challenge that requires further exploration.

Bias and Fairness While our debiasing techniques address text modality bias in LVLMs, other forms of bias inherent in the training data or model architecture may still persist. These biases could result in unintended consequences, such as reinforcing stereotypes or generating outputs that disproportionately impact certain groups. Future research should focus on identifying, evaluating, and mitigating broader societal biases in both the data and model architectures to ensure fair and equitable behavior in LVLMs across various contexts.

Human Oversight and Accountability Our method reduces reliance on external datasets, human annotations, and judgment models, which improves scalability and efficiency. However, this raises concerns about the potential lack of human oversight. While the self-judgment capabilities of the model show promise, they may not always align with human ethical standards, especially in sensitive or high-stakes applications.

We believe that human oversight and intervention should remain integral to the deployment of LVLMs, particularly in critical domains such as healthcare, law, and education. Ensuring alignment with human ethical principles and maintaining accountability throughout the lifecycle of these systems is essential for their safe and responsible use.

Acknowledgements

We sincerely appreciate the reviewers and the AC for their valuable suggestions throughout the review process.

Appendix A Experimental Details

A.1 Implementation Details

A.1.1 Enhancing Faithfulness through DSGD

Sentence-Level Beam Search. We set the parameters as follows to balance both diversity and quality in the sampled data. The num_beams parameter is set to 5. Additionally, the num_token_beams is also configured to 5, ensuring that 5 token-level search results are returned per beam search. The eos_token_id is set to the token corresponding to a period (.), enabling sentence-by-sentence control of the generation process. Finally, $\alpha$ is set to 1.

To increase data diversity, we implement group beam search by setting the num_beam_group parameter to 5. This technique, combined with token-level search, significantly enhances the diversity of the sampled data. Furthermore, we adjust the diversity_penalty parameter to 3.0, which regulates both diversity and quality among the different beam groups.

A.1.2 Ensuring Safety via FGSD

In FGSD, $\alpha$ is set to 0.1. As described in equation 4, we sample 1000 questions from models’ training datasets, and calculate the unsafe score for LLaVA 1.5, InstructBLIP, and mPLUG-Owl2, setting the thresholds at 23, 22.4, and 14.9, respectively. The statistical results are shown in figures 6, 7, and 8. To calculate MCR, we sample data from MSCOCO (Lin et al., 2014), ShareGPT-4V (Chen et al., 2023), MovieNet (Huang et al., 2020), Google Landmark v2 (Weyand et al., 2020), VQA v2 (Goyal et al., 2017), OKVQA (Marino et al., 2019), and TextVQA (Singh et al., 2019)

A.1.3 Improving Overall Capability with DSR

The hyperparameters for generating the data are the same as those for DSGD. The training hyperparameters are listed in Table 6. The model was trained for 1 epoch, which took 6 hours on a single A100 80GB GPU.

A.2 Overview of Baselines

We evaluate our approach against several established decoding methods, including greedy decoding, nucleus sampling, Beam Search, DoLa (Chuang et al., 2023), visual contrastive decoding (VCD) (Leng et al., 2023), HALC (Chen et al., 2024b), LURE (Zhou et al., 2023), Woodpecker (Yin et al., 2023), and OPERA (Huang et al., 2023). Greedy decoding deterministically selects the highest-probability token at each step, while Beam Search extends this by exploring multiple high-probability sequences simultaneously. Nucleus sampling focuses on sampling from the top portion of the probability distribution. DoLa contrasts logits from different layers to mitigate hallucinations in LLMs. OPERA combats hallucinations by introducing an over-trust penalty and using a retrospection-allocation mechanism to reduce dependence on limited summary tokens. VCD, specifically designed for vision-language models, reduces object hallucinations by contrasting outputs from original and modified images. HALC is a decoding strategy that reduces object hallucinations by using an adaptive focal-contrast grounding mechanism to correct hallucinating tokens and a matching-based beam search to balance hallucination mitigation with text generation quality. LURE and Woodpecker respectively use MiniGPT-4 and GPT-3.5 to modify the hallucination-containing outputs of the models.

A.3 Evaluation Metrics and Benchmarks

In our experiments, we use tasks such as visual question answering Fu et al. (2024); Yang et al. (2025b); Zhao and Zhang (2024); Cao and Zhao (2025) and image captioning.

•

MME Fu et al. (2024) offers a robust benchmark for evaluating LVLMs across multimodal tasks. It assesses models on two major fronts: perception and cognition, using 14 well-structured subtasks that challenge their interpretive and analytical abilities.

•

SEED-Bench Li et al. (2023b) focuses on measuring the generative comprehension of LVLMs. It includes a large dataset of 19K multiple-choice questions, complete with human annotations, spanning 12 different evaluation dimensions to test both spatial and temporal reasoning across images and videos.

•

LLaVAW Liu et al. (2023c) provides a targeted evaluation for visual reasoning models. It features 24 diverse images paired with 60 questions, covering a variety of scenarios, including indoor, outdoor, and abstract settings.

•

MMBench Liu et al. (2024d) takes a two-pronged approach by introducing an extensive dataset that broadens the scope of evaluation questions and a novel CircularEval strategy that utilizes ChatGPT to convert free-form responses into structured answer choices.

•

MM-Vet Yu et al. (2023b) is designed to assess LVLMs through a wide range of multimodal tasks, structured into 16 distinct integrations based on 6 core vision-language capabilities, providing a detailed performance analysis across different question types and answer formats.

•

ScienceQA Lu et al. (2022) focuses on evaluating multi-hop reasoning and interpretability within scientific domains. It features a large dataset of approximately 21K multiple-choice questions across a variety of science topics, accompanied by detailed annotations and explanations.

•

VizWiz Gurari et al. (2018) stands out in the VQA field by using a dataset of over 31,000 visual questions that come from a real-world setting, featuring images taken by visually impaired individuals and their associated spoken queries, along with crowdsourced answers.

•

GQA Hudson and Manning (2019) is built for complex visual reasoning tasks, containing 22 million questions generated from scene graph-based structures. It incorporates innovative evaluation metrics focused on consistency, grounding, and plausibility, pushing the boundaries of vision-language evaluation.

•

POPE Li et al. (2023d) introduces a methodology to evaluate object hallucination in LVLMs, transforming the task into a binary classification problem. By using simple Yes-or-No prompts, POPE highlights model tendencies towards hallucination through various object sampling strategies.

•

CHAIR Rohrbach et al. (2019) is a widely-used metric for assessing object hallucination in image captioning. It includes two variants: CHAIR ${}_{\text{I}}$ , which evaluates object hallucination at the instance level, and CHAIR ${}_{\text{S}}$ , which does so at the sentence level. Both are defined as:

[TABLE]

For our evaluation, we randomly sampled 500 images from the COCO Lin et al. (2014) validation set and applied the CHAIR metric to measure hallucinations.

•

MM-SafetyBench Liu et al. (2024c) is a comprehensive safety evaluation framework for Multimodal Large Language Models (MLLMs). The benchmark targets models’ vulnerabilities to visual prompt attacks, particularly those triggered by harmful query-relevant images. It consists of 13 different scenarios (e.g., illegal activity, hate speech, physical harm), represented by 5,040 text-image pairs, to assess how well MLLMs can avoid producing unsafe responses. Experimental results show that many MLLMs, including state-of-the-art models like LLaVA-1.5, are highly susceptible to attacks, especially when prompted with query-relevant images. MM-SafetyBench helps quantify these risks and provides insights into improving the safety protocols of MLLMs.

•

FaithScore Jing et al. (2024) is a reference-free, fine-grained evaluation metric designed to measure the faithfulness of free-form answers generated by large vision-language models (LVLMs). FaithScore evaluates the consistency between descriptive sub-sentences in the generated answers and the input images. The process involves three steps: (1) identifying descriptive sub-sentences, (2) extracting atomic facts from these sub-sentences, and (3) verifying these facts against the input image. FaithScore has shown a strong correlation with human judgments on faithfulness, providing a more interpretable and fine-grained evaluation compared to existing metrics.

Appendix B Efficiency Analysis

Large Models face effciency challenge Yang et al. (2025a); Hu et al. (2025); Li et al. (2025). We present a comparison of time efficiency between DSGD and other approaches in Table 7.

Appendix C More Result

C.1 Quantitative Analysis of Self-Judgment Score

We extend our investigation beyond the LLaVA-1.5-7B model by including results for InstructBLIP and mPLUG-Owl2, as detailed in Table 8. Here, Spearman’s rank correlation coefficients measure how strongly two variables increase or decrease together, ranging from $-1$ to $1$ , with higher values indicating a stronger positive relationship. Positive values indicate a positive correlation, while negative values indicate a negative correlation. These additional analyses further confirm the existence of bias toward the textual modality in the self-judgment of LVLMs.

C.2 Scalability Study of DSR with Larger-Scale Preference Data

We further investigate the scalability of DSR by increasing the amount of preference data used for training. Specifically, we compare the performance of DSR when trained with 6K and 10K preference data, alongside the original LLaVA-1.5-7B baseline. As shown in Table 9, increasing the training data from 6K to 10K leads to consistent improvements across most benchmarks. Notably, DSR achieves the best or tied-best results on all metrics when scaled to 10K data, demonstrating its strong scalability and effectiveness. These findings indicate that DSR can effectively leverage larger-scale preference data to further enhance the overall capability of LVLMs.

C.3 VILA Experiments with DSR

To evaluate the generalizability of DSR, we applied it to the advanced VILA (Lin et al., 2024) model across various benchmarks. Table 10 presents the experimental results of VILA combined with different preference data curation methods: the baseline VILA, VILA+CSR, and VILA+DSR.

C.4 Settings of Hyper-parameters

Further ablation studies on the effects of hyper-parameters are presented in Figures 9, 10, 11 and Table 11. Figure 9 illustrates the effect of number of beams in DSGD. Figure 10 illustrates the effect of diversity_penalty in DSGD. Figure 11 illustrates the effect of $\alpha$ in DSGD. Table 11 illustrates the effect of $\alpha$ in FGSD.

C.5 Ablation Studies

The ablation study results for FGSD and DSR can be found in Table 12 and Table 13.

C.6 Case Studies

Figure 12 presents a case where our approach enhances faithfulness. Figure 13 illustrates how our method safely prevents an attack, while Figure 14 demonstrates that the model cannot assess the safety of the response without image input.

Appendix D Prompt Design

The detailed prompt designs for each task are shown in Tables 14, 15, 16, 17, and 18.

Bibliography83

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774 .
2Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxu
3Cao and Zhao (2025) Linbo Cao and Jinman Zhao. 2025. Pretraining on the test set is no longer all you need: A debate-driven approach to qa benchmarks . Preprint , ar Xiv:2507.17747.
4Chan et al. (2023) Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. ar Xiv preprint ar Xiv:2308.07201 .
5Chang et al. (2024) Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology , 15(3):1–45.
6Chen et al. (2023) Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2023. Sharegpt 4v: Improving large multi-modal models with better captions. ar Xiv preprint ar Xiv:2311.12793 .
7Chen et al. (2024 a) Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. 2024 a. Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14239–14250.
8Chen et al. (2024 b) Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. 2024 b. Halc: Object hallucination reduction via adaptive focal-contrast decoding. ar Xiv preprint ar Xiv:2403.00425 .