Improving Alignment in LVLMs with Debiased Self-Judgment
Sihan Yang, Chenhang Cui, Zihao Zhao, Yiyang Zhou, Weilong Yan, Ying Wei, Huaxiu Yao

TL;DR
This paper introduces a novel internal self-evaluation method for LVLMs that reduces hallucinations and improves alignment without external data, enhancing safety and performance.
Contribution
The paper proposes a debiased self-judgment score that enables LVLMs to self-evaluate and improve alignment autonomously, reducing reliance on external datasets.
Findings
Significantly reduces hallucinations in LVLM outputs.
Improves safety and alignment quality.
Outperforms traditional alignment methods.
Abstract
The rapid advancements in Large Language Models (LLMs) and Large Visual-Language Models (LVLMs) have opened up new opportunities for integrating visual and linguistic modalities. However, effectively aligning these modalities remains challenging, often leading to hallucinations--where generated outputs are not grounded in the visual input--and raising safety concerns across various domains. Existing alignment methods, such as instruction tuning and preference tuning, often rely on external datasets, human annotations, or complex post-processing, which limit scalability and increase costs. To address these challenges, we propose a novel approach that generates the debiased self-judgment score, a self-evaluation metric created internally by the model without relying on external resources. This enables the model to autonomously improve alignment. Our method enhances both decoding…
| LLaVA-1.5 | InstructBLIP | mPLUG-Owl2 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Method | CHAIRS | CHAIRI | BLEU | CHAIRS | CHAIRI | BLEU | CHAIRS | CHAIRI | BLEU |
| Greedy | 22.4 | 5.8 | 0.249 | 29.0 | 12.9 | 0.217 | 23.1 | 8.4 | 0.279 |
| Beam Search | 19.6 | 6.3 | 0.247 | 31.8 | 14.3 | 0.228 | 22.5 | 8.1 | 0.280 |
| DoLA | 21.0 | 6.7 | 0.256 | 30.0 | 9.1 | 0.238 | 22.0 | 7.8 | 0.283 |
| OPERA | 26.4 | 7.8 | 0.210 | 26.0 | 8.2 | 0.251 | 18.6 | 6.6 | 0.286 |
| VCD | 20.7 | 5.3 | 0.247 | 25.8 | 7.1 | 0.244 | 25.5 | 9.2 | 0.273 |
| Woodpecker | 17.5 | 4.0 | 0.259 | 28.0 | 11.0 | 0.249 | 20.0 | 7.3 | 0.286 |
| LURE | 18.0 | 4.5 | 0.253 | 31.0 | 11.9 | 0.251 | 16.4 | 6.4 | 0.283 |
| HALC | 15.9 | 3.5 | 0.255 | 27.2 | 10.3 | 0.253 | 21.1 | 7.4 | 0.298 |
| DSGD | 15.2 | 4.0 | 0.263 | 20.1 | 6.9 | 0.271 | 14.2 | 4.5 | 0.300 |
| Method | F-Score ↑ | F-ScoreS ↑ |
|---|---|---|
| Greedy | 84.6 | 66.3 |
| VCD | 85.2 | 63.1 |
| Opera | 88.4 | 67.9 |
| HALC | 86.3 | 67.8 |
| LURE | 88.8 | 67.4 |
| Woodpecker | 86.2 | 66.5 |
| DSGD | 89.3 | 75.1 |
| Method | MCR ↓ | IA ↓ | HS ↓ | MG ↓ | Fr ↓ | Po ↓ | PV ↓ | Avg ↓ | |
|---|---|---|---|---|---|---|---|---|---|
| LLaVA-1.5 | Vanilla | - | 89.7 | 65.0 | 63.6 | 74.0 | 78.0 | 68.3 | 73.1 |
| ECSO | 0 | 37.1 | 20.2 | 20.5 | 31.2 | 63.3 | 35.3 | 34.6 | |
| FGSD (Ours) | 0 | 15.3 | 26.2 | 17.9 | 15.6 | 21.8 | 18.9 | 19.3 | |
| InstructBLIP | Vanilla | - | 69.1 | 44.1 | 45.5 | 43.5 | 43.1 | 49.6 | 49.2 |
| ECSO | 14.6 | - | - | - | - | - | - | - | |
| FGSD (Ours) | 0 | 17.8 | 18.6 | 20.3 | 24.5 | 40.1 | 33.5 | 25.8 | |
| mPLUG-Owl2 | Vanilla | - | 94.8 | 81.6 | 81.8 | 85.7 | 75.2 | 88.5 | 84.6 |
| ECSO | 0 | 22.7 | 28.2 | 38.6 | 24.0 | 69.7 | 86.3 | 44.9 | |
| FGSD (Ours) | 0 | 13.7 | 19.1 | 33.0 | 12.4 | 38.5 | 31.2 | 24.7 |
| Comprehensive Benchmark | General VQA | Hallucination Benchmark | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | MME ↑ | SEED ↑ | LLaVAW ↑ | MMB ↑ | MM-Vet ↑ | SQAI ↑ | VisWiz ↑ | GQA ↑ | POPE ↑ | CHAIRS ↓ | CHAIRI ↓ |
| LLaVA-1.5 7B | 1858.9 | 58.6 | 63.4 | 64.3 | 30.5 | 66.8 | 50.0 | 62.0 | 85.9 | 48.8 | 14.9 |
| + Silkie | 1754.5 | 59.3 | 62.1 | 64.0 | 31.2 | 66.2 | 52.6 | 63.2 | 83.7 | 40.3 | 13.2 |
| + LLaVA-RLHF | 1825.6 | 58.1 | 63.7 | 63.4 | 31.1 | 65.8 | 51.7 | 61.3 | 81.5 | 38.7 | 11.3 |
| + POVID | 1778.1 | 60.2 | 65.8 | 64.9 | 31.8 | 68.8 | 53.6 | 61.7 | 86.9 | 35.2 | 8.3 |
| + RLHF-V | 1838.6 | 60.1 | 65.4 | 63.6 | 30.9 | 67.1 | 54.2 | 62.1 | 86.2 | 29.7 | 7.5 |
| + RLAIF-V | - | - | - | - | - | - | - | - | - | 21.2 | 4.7 |
| + CSR | 1851.5 | 60.6 | 66.0 | 64.3 | 32.1 | 68.5 | 53.1 | 61.8 | 86.9 | 30.6 | 8.2 |
| + DSR (Ours) | 1879.8 | 60.8 | 66.3 | 64.5 | 32.1 | 69.2 | 54.2 | 62.1 | 87.1 | 27.1 | 6.9 |
| Methods | CHAIRS ↓ | CHAIRI ↓ |
|---|---|---|
| w/o Self-Judgment | 24.4 | 8.0 |
| w/o Debiasing | 19.0 | 6.2 |
| DSGD | 15.2 | 5.0 |
| Hyperparameters | |
| lora_r | 128 |
| lora_alpha | 256 |
| lora_target | all |
| mm_projector_lr | 2e-5 |
| Batch size | 1 |
| Learning rate | 1e-7 |
| model_max_length | 1024 |
| Require finetuning | Require external tool | Only work for image captioning | Execution time(s) | |
| Greedy | × | × | × | 1.1 |
| Beam Search | × | × | × | 2.0 |
| DoLA | × | × | × | 10.5 |
| VCD | × | × | ✓ | 9.9 |
| Opera | × | × | ✓ | 12.5 |
| POVID | ✓ | × | × | 1.2 |
| LURE | ✓ | × | ✓ | 3.9 |
| WoodPecker | × | ✓ | × | N/A |
| DSGD(Ours) | × | × | × | 3.5 |
| Model | Self-Judgment vs. FaithScore | Self-Judgment vs. Blind Self-Judgment |
|---|---|---|
| LLaVA-1.5-7B | 0.673 | 0.273 |
| InstructBLIP | 0.629 | 0.371 |
| mPLUG-Owl2 | 0.750 | 0.296 |
| Method | MMEP ↑ | MMEC ↑ | SEED ↑ | LLaVAW ↑ | MMB ↑ | MM-Vet ↑ | SQAI ↑ | VisWiz ↑ | GQA ↑ | POPE ↑ | CHAIRS ↓ | CHAIRI ↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLaVA-1.5-7B | 1510.7 | 348.2 | 58.6 | 63.4 | 64.3 | 30.5 | 66.8 | 50.0 | 62.0 | 85.9 | 48.8 | 14.9 |
| 6K Data | 1500.6 | 379.2 | 60.8 | 66.3 | 64.5 | 32.1 | 69.2 | 54.2 | 62.1 | 87.1 | 27.1 | 6.9 |
| 10K Data | 1508.2 | 380.5 | 61.3 | 66.7 | 64.7 | 32.7 | 69.5 | 55.0 | 62.1 | 87.1 | 25.8 | 6.1 |
| Method | MME ↑ | SEED ↑ | LLaVAW ↑ | MMB ↑ | MM-Vet ↑ | SQAI ↑ | VisWiz ↑ | GQA ↑ | POPE ↑ | CHAIRS ↓ | CHAIRI ↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| VILA | 1849.4 | 61.1 | 69.7 | 68.9 | 34.9 | 68.2 | 57.8 | 62.3 | 85.50 | 31.0 | 8.8 |
| + CSR | 1852.5 | 63.2 | 73.5 | 69.3 | 38.3 | 71.9 | 62.3 | 62.2 | 86.82 | 29.2 | 7.9 |
| + DSR | 1875.5 | 63.2 | 73.9 | 69.7 | 38.4 | 72.4 | 61.0 | 62.5 | 86.96 | 28.5 | 7.4 |
| Method | MCR ↓ | IA ↓ | HS ↓ | MG ↓ | Fr ↓ | Po ↓ | PV ↓ | Avg ↓ | |
|---|---|---|---|---|---|---|---|---|---|
| LLaVA-1.5 | w/o Defense | - | 89.7 | 65.0 | 63.6 | 74.0 | 78.0 | 68.3 | 73.1 |
| 0 | 16.5 | 27.5 | 18.0 | 18.8 | 22.3 | 20.5 | 20.6 | ||
| 0 | 11.3 | 21.4 | 13.3 | 11.0 | 17.4 | 14.3 | 14.8 |
| Method | MCR ↓ | IA ↓ | HS ↓ | MG ↓ | Fr ↓ | Po ↓ | PV ↓ | Avg ↓ | |
|---|---|---|---|---|---|---|---|---|---|
| LLaVA-1.5 | w/o Defense | - | 89.7 | 65.0 | 63.6 | 74.0 | 78.0 | 68.3 | 73.1 |
| w/o Debiasing | 0 | 13.4 | 21.9 | 15.1 | 12.0 | 18.9 | 17.5 | 16.5 | |
| FGSD (Ours) | 0 | 11.3 | 21.4 | 13.3 | 11.0 | 17.4 | 14.3 | 14.8 |
| Comprehensive Benchmark | General VQA | Hallucination Benchmark | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | MMEP | MMEC | SEED | LLaVAW | MMB | MM-Vet | SQAI | VisWiz | GQA | POPE | CHAIRS | CHAIRI |
| LLaVA-1.5-7B | 1510.7 | 348.2 | 58.6 | 63.4 | 64.3 | 30.5 | 66.8 | 50.0 | 62.0 | 85.9 | 48.8 | 14.9 |
| w/o Debiasing | 1495.3 | 370.0 | 60.6 | 65.8 | 64.3 | 32.0 | 69.3 | 54.0 | 61.7 | 86.7 | 30.2 | 9.1 |
| w/o ILJ | 1494.4 | 369.7 | 60.7 | 66.0 | 64.5 | 32.0 | 68.8 | 54.1 | 62.0 | 86.7 | 28.9 | 7.7 |
| + DSR (Ours) | 1500.6 | 379.2 | 60.8 | 66.3 | 64.5 | 32.1 | 69.2 | 54.2 | 62.1 | 87.1 | 27.1 | 6.9 |
| Please analyze the provided image and determine if the following description is accurate. |
| The description is:{description} |
| “yes” or “no” or “maybe”. |
| Answer: {text} |
| Does the Answer contains harmful and unethical suggestions? |
| “yes” or “no” or “maybe”. |
| Please analyze the provided image and determine if the answer to the question is correct. |
| Question: {qs} |
| Answer: {ans} |
| “yes” or “no” or “maybe”. |
| Please analyze the provided image and determine if the answer to the question is correct. |
| The description is: {description} |
| “yes” or “no” or “maybe”. |
| Please analyze the provided image and determine if the answer to the question is correct. |
| Question: {qs} |
| Answer: {ans} |
| “yes” or “no” or “maybe”. |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Mobile Agent-Based Network Management · Service-Oriented Architecture and Web Services
Improving Alignment in LVLMs with Debiased Self-Judgment
Sihan Yang1∗ Chenhang Cui2∗ Zihao Zhao1 Yiyang Zhou3 Weilong Yan2
Ying Wei1 Huaxiu Yao3
1 Nanyang Technological University 2 National University of Singapore 3 UNC-Chapel Hill
[email protected] [email protected]
Abstract
The rapid advancements in Large Language Models (LLMs) and Large Visual-Language Models (LVLMs) have opened up new opportunities for integrating visual and linguistic modalities. However, effectively aligning these modalities remains challenging, often leading to hallucinations—where generated outputs are not grounded in the visual input—and raising safety concerns across various domains. Existing alignment methods, such as instruction tuning and preference tuning, often rely on external datasets, human annotations, or complex post-processing, which limit scalability and increase costs. To address these challenges, we propose a novel approach that generates the debiased self-judgment score, a self-evaluation metric created internally by the model without relying on external resources. This enables the model to autonomously improve alignment. Our method enhances both decoding strategies and preference tuning processes, resulting in reduced hallucinations, enhanced safety, and improved overall capability. Empirical results show that our approach significantly outperforms traditional methods, offering a more effective solution for aligning LVLMs. Code is at [https://github.com/sihany077/LVLM_Debiased
_Self_Judge](https://github.com/sihany077/LVLM_Debiased_Self_Judge).
Improving Alignment in LVLMs with Debiased Self-Judgment
** Sihan Yang1∗ Chenhang Cui2∗ Zihao Zhao1 Yiyang Zhou3 Weilong Yan2**
Ying Wei1 Huaxiu Yao3
1 Nanyang Technological University 2 National University of Singapore 3 UNC-Chapel Hill
[email protected] [email protected]
1 Introduction
Owing to the powerful capabilities of Large Language Models (LLMs) (Bai et al., 2023; Touvron et al., 2023; Chiang et al., 2023), Large Visual-Language Models (LVLMs) demonstrate impressive performance by effectively integrating visual inputs into the latent representation space of LLMs (Liu et al., 2023c; Ye et al., 2023a; Zhu et al., 2023). However, similar to LLMs, LVLMs face inherent alignment challenges, including hallucinations (where the generated content is not grounded in the image) (Li et al., 2023d; Liu et al., 2023a), and safety issues (Liu et al., 2024a; Pi et al., 2024), which negatively impact the application of LVLMs across various domains (Li et al., 2024; Liu et al., 2024b; Zhang et al., 2024).
To address misalignment in LVLMs, a growing body of recent research has explored enhancing model alignment by leveraging external tools or human annotations to assist with preference tuning (Yu et al., 2024b; Wang et al., 2024; Yu et al., 2024a) and inference (Yin et al., 2023; Lee et al., 2024). However, most prevailing approaches rely heavily on powerful external resources—such as advanced models like GPT (Achiam et al., 2023) or human experts—which can lead to substantial costs during both training and inference. Moreover, in a hypothetical future where an AI system requiring alignment surpasses both human intelligence and the capabilities of other models, supervision from humans or existing models may offer only limited effectiveness for such a superintelligent system.
In response to these challenges, we draw inspiration from the effective self-reflection abilities observed in LLMs (Kadavath et al., 2022) and explore how LVLMs can self-evaluate and enhance their alignment independently. We observe that the internal confidence of LVLMs can reflect the faithfulness of their output sentences, but it also incorporates significant textual priors. Building on this, we introduce the debiased self-judgment score, a sentence-level evaluation metric generated autonomously by the model without relying on external resources. This score is applied to both decoding and preference tuning. Our results show that this approach significantly enhances LVLMs’ performance, improving faithfulness, safety, and overall capability, as shown in Figure 1. In summary, our contributions are three-fold:
- •
We demonstrate that leveraging LVLM’s intrinsic confidence as a self-judgment score is effective, but it is influenced by strong textual priors. To address this, we propose a debiasing method for the self-judgment score.
- •
The debiased self-judgment score is used to guide decoding, resulting in more faithful and safer outputs. It is also applied to self-improvement training, improving model performance across multiple dimensions.
- •
Experiments on hallucination, safety, and comprehensive benchmarks across different LVLMs validate our method’s effectiveness.
2 Related Work
2.1 Alignment in LVLMs
LVLMs demonstrate exceptional performance across a range of tasks (Liu et al., 2024b; Li et al., 2024; Zhang et al., 2024). However, they remain vulnerable to misalignment issues, which can lead to significant challenges such as safety concerns and hallucinations. To mitigate hallucinations, several methods have been proposed, including instruction tuning (Liu et al., 2023a), decoding strategies (Leng et al., 2024; Huang et al., 2024; Park et al., 2024; Chen et al., 2024b), preference fine-tuning (Sun et al., 2023; Yu et al., 2023a), and improved vision encoders (Jain et al., 2024). To tackle safety challenges, researchers have employed strategies such as fine-tuning for safety (Chen et al., 2024a; Pi et al., 2024), adopting robust architectures (Hossain and Imteaj, 2024), and evaluating responses with the assistance of other models (Ding et al., 2024). Despite these advancements, most existing methods rely on external models or tools, limiting scalability and introducing potential biases. In contrast, our approach leverages internal model capabilities to generate more faithful, safe responses and improve overall LVLM performance, without external resources.
2.2 Judgment in LLMs and LVLMs
The LLM-as-a-Judge (Zheng et al., 2023) paradigm has become a widely adopted method for evaluating the quality of outputs from large language models Wang et al. (2023); Yuan et al. (2024); Chan et al. (2023). This approach typically involves using one language model to assess the outputs of another Kim et al. (2023); Chan et al. (2023); Chang et al. (2024), providing a scalable alternative to traditional human evaluation. Beyond language models, LVLM judges have also been widely applied for various purposes, such as evaluating LVLM performance (Xiong et al., 2024; Jing et al., 2023), correcting unfaithful outputs during inference (Lee et al., 2024), and generating preference data to improve the overall performance of LVLMs (Wang et al., 2024). However, these methods often rely on powerful models (e.g., RLAIF-V (Yu et al., 2024b)), additional training of the judge model (e.g., Volcano (Lee et al., 2024), LLaVA-Critic (Xiong et al., 2024)), or human annotations (e.g., SIMA (Wang et al., 2024)), which limit scalability and introduce additional costs. In contrast, our proposed approach harnesses the models’ intrinsic confidence to accurately assess LVLMs’ outputs. This shows the potential of LVLMs’ self-judgment for inference and preference data generation, without external models or human annotation.
3 Preliminary Observations
In this section, we present preliminary findings on the potential and limitations of LVLMs’ self-judgment abilities, which serve as the foundation for our proposed debiased self-judgment score.
3.1 Potential of LVLMs for Self-Judgment
Previous research (Kadavath et al., 2022; Phute et al., 2023) shows that LLMs can sometimes evaluate the accuracy of their own responses, offering a scalable way to assess model outputs. Inspired by this, we explore whether LVLMs can self-evaluate to improve alignment and output quality. Specifically, we focus on faithfulness—the correspondence between image descriptions and visual content—as it is a key aspect of alignment in LVLMs. We use LLaVA-1.5 7B (Liu et al., 2023b) to generate one description for each of 500 randomly selected images from the MSCOCO dataset (Lin et al., 2014). To objectively measure the faithfulness of these descriptions, we calculate the FaithScore (Jing et al., 2023), defined as the proportion of correct atomic facts to total atomic facts in a description (a score closer to 1 indicates higher faithfulness). To enable the LVLM to self-assess description faithfulness, we use the prompt “Is the description accurate?” and extract the logit for the “Yes” response as the self-judgment score. The correlation between self-judgment scores and FaithScores is illustrated in Figure 2 (Top).
The figure shows a positive correlation between self-judgment scores and FaithScores, indicating higher confidence often corresponds to more accurate descriptions. However, the moderate correlation suggests that self-judgment alone may not fully capture faithfulness, requiring further refinement.
3.2 LVLMs’ Limitations in Self-Judgment
LVLMs build on the advanced text-generation capabilities of LLMs to create multimodal frameworks, yet they inherit unimodal biases from these language models. For example, prior research (Leng et al., 2024; Han et al., 2022; Li et al., 2023d) indicates that LVLMs tend to overlook image content and overly rely on text-based priors when generating descriptions.
We further investigate whether these unimodal biases affect the LVLMs’ ability to assess the faithfulness of their outputs. Specifically, we reuse the 500 image descriptions and their corresponding self-judgment scores obtained in Section 3.1. To isolate the model’s text-based priors, we remove the images and have the same LVLM evaluate the faithfulness of the sentences using the self-judgment method described in Section 3.1. This generates scores (referred to as blind self-judgment scores) that represent the model’s text-based priors.
As shown in Figure 2 (Bottom), the moderate positive correlation between the LVLM’s self-judgment scores and the blind self-judgment scores suggests that the model’s self-judgment is biased toward the textual modality, rather than reflecting true multimodal faithfulness. Quantitative analyses on more models are provided in Appendix C.1.
4 Method
In this section, we propose a method that leverages the model’s internal confidence for self-judgment and eliminates text modality bias, resulting in a
debiased self-judgment score. This score is used for decoding and preference tuning to enhance LVLMs’ faithfulness, safety, and overall capability. Specifically, Section 4.1 describes how to derive the debiased self-judgment score and apply it to generate more faithful descriptions; Section 4.2 incorporates the score with a safety prefix to prevent unsafe outputs; and Section 4.3 investigates how both sentence-level and instance-level self-judgment contribute to self-improvement training.
4.1 Deriving the Debiased Self-Judgment Score and Its Application in Decoding for Faithfulness
In this section, using faithfulness evaluation as an example, we introduce a method that leverages the model’s internal confidence to perform self-judgment and mitigate text modality bias, resulting in the debiased self-judgment score. This score is then applied in the decoding process through Debiased Self-Guided Decoding (DSGD) to prioritize visually grounded content and enhance faithfulness. The process is divided into three main components (shown in Figure 3 : Top):
Self-Judgment Scoring. By leveraging the intrinsic confidence of LVLMs, we have the model self-judge its own outputs at the sentence level for factual accuracy. For a sentence generated by the LVLM, we use a , such as “Is the description accurate?”, to guide the LVLM in evaluating the faithfulness of sentence based on the image . We compute the initial faithfulness score, , as the sum of the logits for the tokens “Yes” and “yes” from the LVLM’s next-token predictions:
[TABLE]
where cls represents the tokens “Yes” and “yes”.
Score Debiasing. Notably, as our observations in Section 3.2 reveal, LVLMs inherit bias toward text from Large Language Models, which can lead to inaccurate judgment of their own generated sentences in certain cases. To mitigate this text bias in , we introduce a score debiasing process, as illustrated in Figure 4. Specifically, we first feed the prompt guiding the LVLM to judge and the sentence being judged, without an image, to get logits , which contains only text priors. Then, using the same method as Self-Judgment Scoring, we compute as follows:
[TABLE]
where cls represents “Yes” and “yes”. Finally, to reduce the influence of text modality bias, we employ a contrastive objective to obtain the final faithfulness score:
[TABLE]
Guided Sentence Generation. In this approach, the generation process is guided by the debiased self-judgment scores to maintain alignment between the generated descriptions and the visual content. We adopt a sentence-by-sentence generation strategy, using debiased self-judgment scores to select each sentence in order to maintain fluency and faithfulness to the image. To minimize the cost of inference, we employ a greedy search strategy for sentence selection. At each step , given the partially generated description , the model generates candidate sentences for the next sentence . The candidate with the highest faithfulness score is selected as and appended to . This process continues until an EOS token is reached.
4.2 Self-Defence for Safety
This section presents an application of the debiased self-judgment Score for detecting and moderating unsafe content in LVLMs’ responses, utilizing a fine-grained unsafe score and a safety prefix. This process (shown in Figure 3: Bottom), referred to as Fine-Grained Self-Defense (FGSD), is composed of three key components:
Unsafety Scoring. To evaluate the safety of LVLMs’ responses more precisely, we adopt a sentence-level judgment and leverage the LVLM’s intrinsic ability for self-judgment to achieve higher accuracy while maintaining the model’s utility as much as possible. This section follows the methodology described in Section 4.1 to obtain the unsafe score. The Unsafety Scoring process uses a prompt, such as “Is the response unsafe?”, to calculate the unsafe score . Notably, in some cases, LVLMs cannot determine the safety of a response without visual input, as the assessment relies heavily on image context (see example in Appendix C.6), highlighting the need to mitigate text bias.
Unsafety Threshold Setting. When using the unsafe score to assess the safety of a sentence, it is important to set an appropriate threshold to distinguish between safe and unsafe sentences. This helps reduce unsafe outputs while maintaining the model’s utility. We first generate 1,000 safe responses using prompts from the model’s training dataset (or from the LLaVA-1.5 training set if the model’s training dataset is not publicly available). These safe responses are then scored at the sentence level using the method described in Unsafety Scoring. The final threshold is set as the maximum unsafe
score observed among all verified safe sentences, rounded up to one decimal place. This adjustment provides a margin to prevent the model’s safe outputs from being misclassified as unsafe. The threshold can be formulated as follows:
[TABLE]
where represent the sentences generated as safe responses from prompts sampled from general datasets. Here, represents the ceiling function, which rounds a number up to the smallest integer greater than or equal to its value.
Unsafe Score-Guided Response Moderation. A sentence is considered as containing unsafe content if its unsafe score exceeds the threshold . Upon detecting an unsafe output, the response is prefixed with "Sorry, answering the question will generate harmful content, because". This prefix, together with the original prompt, is then provided back to the LVLM, prompting it to generate the subsequent tokens. Leveraging its autoregressive architecture, the LVLM is able to autonomously produce a coherent explanation for the refusal.
4.3 Dual Self-Judgment for More Significant Self-Improvement
In this section, we present a self-rewarding training paradigm for LVLMs, referred to as Debiased Self-Rewarding (DSR). We propose a dual self-judgment mechanism for preference tuning (shown in Figure 5), which includes: (1) using the debiased self-judgment score as a reward signal for sentence-level preference data generation, and (2) refining instance-level preference data quality through self-judgment. This mechanism generates high-quality preference data, which is used to fine-tune the LVLM via Direct Preference Optimization (Rafailov et al., 2024) to achieve self-improvement. The method is described as follows:
Preference Data Generation. We generate two types of preference data for training: question answering and detailed description. Similar to the setup in Sec 4.1, at each step, the sentence with the highest debiased self-judgment score is selected as the preferred response, and the sentence with the lowest score as the dispreferred response. The process continues by generating new sentence candidates based on the selected sentences until the EOS token is reached.
Data Cleaning. We notice that the preferred data contains incorrect responses, while the dispreferred data includes correct ones, which could undermine the model’s performance during training. To resolve this, we use the same LVLM to evaluate the correctness of responses at the instance level. If the LVLM outputs “Yes”, the response is considered correct; otherwise, it is deemed incorrect. Consequently, incorrect responses in the preferred data and correct responses in the dispreferred data are removed. The final preference data is defined as: , where and denote the preferred and dispreferred responses for the input prompt .
Preference Tuning. After obtaining the cleaned preference data, we fine-tune the target LVLM using DPO. The loss of DPO is defined as:
[TABLE]
where the model policy is initialized from the base reference policy , is a parameter controlling the deviation from , and denotes the logistic function.
5 Experiments
In this section, we evaluate the performance of the proposed debiased self-judgment score across various applications, aiming to answer the following questions: (1) Can DSGD effectively reduce hallucinations in LVLMs compared to other baselines? (2) Can FGSD reduce unsafe outputs while maintaining the utility of LVLMs? (3) Can DSR effectively enhance the comprehensive capabilities of LVLMs? (4) Are the self-judgment method and the debiasing method we designed truly effective?
5.1 Enhancing Faithfulness through DSGD
Experimental Settings. We evaluate our method’s performance on object hallucination using the CHAIR (Rohrbach et al., 2018) metric on the MSCOCO (Lin et al., 2014) dataset, while BLEU (Papineni et al., 2002) is used to assess overall generation quality. FaithScore (Jing et al., 2023) measures hallucinations involving objects, attributes, and relationships. For hallucination mitigation during inference, we test six methods: Dola (Chuang et al., 2023), VCD (Leng et al., 2024), Opera (Huang et al., 2024), LURE (Zhou et al., 2023), Woodpecker (Yin et al., 2023), and HALC (Chen et al., 2024b), along with two conventional decoding strategies—greedy decoding and beam search. The experiments are conducted on LLaVA-1.5 (Liu et al., 2023b), InstructBLIP (Dai et al., 2023), and mPLUG-Owl2 (Ye et al., 2023b). For DSGD, we set to and to . The self-judgment prompt is detailed in Appendix D, while baseline and benchmark details are in Appendix A.2 and Appendix A.3.
Results. The primary experimental results are summarized in Table 1. Our proposed DSGD method achieves state-of-the-art performance in hallucination mitigation during inference, significantly reducing object hallucinations with notable decreases in CHAIR scores (31.33% for LLaVA-1.5, 42.42% for InstructBLIP, and 47.63% for mPLUG-Owl2). In addition, DSGD improves BLEU scores, reflecting an overall improvement in captioning quality. Table 2 further reinforces these findings, showing that DSGD surpasses other methods across a comprehensive evaluation of hallucinations, including objects, attributes, and relationships. DSGD consistently delivers the best results on both FaithScore and Sentence-level FaithScore, underscoring its robustness in ensuring caption faithfulness.
5.2 Ensuring Safety via FGSD
Experimental Settings. To measure safety performance, we follow previous works by utilizing commonly employed subsets of the MM-SafetyBench (Liu et al., 2023d). To assess whether a method preserves the model’s original utility, we generate 1,000 safe responses using prompts from general datasets (see Appendix A.1.2 for details) and calculate the proportion of safe responses incorrectly classified as unsafe, reported as the Misclassification Rate (MCR). We use the same three LVLMs as described in the previous section. ECSO (Gou et al., 2024) is chosen as the baseline, due to its enhanced safety during the inference phase. For FGSD, we set to for all experiments. The prompt used for judging safety is provided in Appendix D.
Results. The results in Table 3 show that FGSD consistently outperforms baseline methods across three models—LLaVA-1.5, InstructBLIP, and mPLUG-Owl2—on the MM-SafetyBench. FGSD achieves a significantly lower attack success rate (ASR) compared to the baseline without defense, reducing ASR by 73.6% for LLaVA-1.5, 47.6% for InstructBLIP, and 70.8% for mPLUG-Owl2, highlighting substantial safety improvement across these models. Although ECSO improves safety relative to no defense, it is less effective than FGSD. For InstructBLIP, ECSO reports a high misclassification rate (MCR) of 14.6%,
where many safe outputs are incorrectly flagged as unsafe, reducing the model’s practical utility. In contrast, FGSD achieves zero MCR across all models, maintaining both safety and utility without compromising output accuracy. These findings underscore FGSD’s superior ability to enhance the safety of LVLMs during inference, without sacrificing the model’s utility, as observed in ECSO.
5.3 Improving Overall Capability with DSR
Experimental Settings. To evaluate DSR in improving LVLM capability, we conduct experiments on three types of benchmarks: comprehensive benchmarks (MME (Fu et al., 2023), SEED-Bench (Li et al., 2023a), LLaVAW (Liu et al., 2023c), MMBench (Liu et al., 2024e), MM-Vet (Yu et al., 2024c)), general VQA tasks (ScienceQA (Lu et al., 2022), VisWiz (Gurari et al., 2018), GQA (Hudson and Manning, 2019)), and hallucination benchmarks (POPE (Li et al., 2023d), CHAIR (Rohrbach et al., 2018)). We utilize LLaVA-1.5 7B as the backbone model. For comparison, DSR is benchmarked against several data-driven preference learning methods, including Silkie (Li et al., 2023c), LLaVA-RLHF (Sun et al., 2023), POVID (Zhou et al., 2024a), RLHF-V (Yu et al., 2024a), and CSR (Zhou et al., 2024b). For DSR, we set to and to . More implementation details are provided in Appendix A.1.3.
Results. As shown in Table 4, DSR significantly outperforms existing preference data curation methods that rely on external resources by delivering a more accurate reward signal through debiased self-judgment. Similarly, CSR, which depends on the CLIP to compute text-image similarity and employs a computationally expensive beam search algorithm, is also outperformed. Despite these methods leveraging powerful external resources, DSR achieves superior results to all baselines using only 6k training data—the smallest dataset among all methods. In comparison, CSR uses 13k data points , further underscoring the high quality of the data generated by DSR and the effectiveness of debiased self-judgment. To ensure a fair comparison and verify the effectiveness of DSR, we restrict our training data to no more than that used by any baseline in the experiments corresponding to Table 4. To further verify the scalability of DSR, additional results corresponding to training with larger-scale data are provided in Appendix C.2. To verify the generalizability of DSR, we apply it to a more advanced model, VILA (Lin et al., 2024), with detailed results provided in Appendix C.3.
5.4 Ablation studies
We conduct an ablation study to assess the impact of Self-Judgment and Score Debiasing on hallucination rates, as measured by CHAIRS and CHAIRI, within our proposed Debiased Self-Guided Decoding (DSGD) method. The results, summarized in Table 5, indicate that when Self-Judgment is removed and candidates are selected randomly instead of guided by the debiased self-judgment score, hallucination rates increase significantly. Similarly, when the Score Debiasing step is removed, which results in a higher reliance on text priors during the self-judgment process, the hallucination rates also rise. In contrast, the full DSGD approach, which integrates both Self-Judgment and Score Debiasing, achieves the lowest hallucination rates. These findings demonstrate the effectiveness of both components in ensuring more faithful image-grounded content generation. Further ablation studies on the effects of hyper-parameters in DSGD, along with the corresponding ablation results for FGSD and DSR, can be found in the Appendix C.4 and Appendix C.5.
6 Conclusion
In this paper, we propose a novel self-alignment method to solve the alignment problems in Large Vision-Language Models. By using a debiased self-judgment score, our approach enables the model to improve its vision-language alignment on its own, eliminating the need for external data or human intervention. Our extensive experiments demonstrate that this method reduces hallucinations and makes LVLMs safer and more powerful. The promising experimental results of our method indicate that self-judgment has considerable potential for enhancing alignment in LVLMs.
Limitations
In this work, we propose a debiased self-judgment score that guides both the decoding process and self-improvement training, enhancing the faithfulness and safety of LVLMs’ outputs, while also driving comprehensive improvements in their overall capabilities. However, our work still has limitations. Firstly, our method relies on accessing the model’s predicted token logits, which are often inaccessible in many closed-source models. This restricts its applicability to more powerful LLMs, such as GPT-4, which do not provide token likelihoods. Secondly, due to computational limitations, we only experimented with common LVLMs. Future work should include experiments on a broader range of models to further validate the effectiveness and generalizability of our approach. To fully understand the applicability of our method across all models, further experiments on a broader range of models are required. Thirdly, in the jailbreak attack experiments, we conducted tests solely in English, so we cannot guarantee the effectiveness of our method for other languages.
Ethical Considerations
In this work, we present a novel approach to improving the alignment of Large Visual-Language Models (LVLMs) using a debiased self-judgment score. While our method enhances faithfulness, safety, and overall performance, it is essential to address the ethical implications of our research to ensure the responsible development and deployment of LVLMs.
Mitigating Harmful Outputs A primary objective of our approach is to enhance the safety of LVLMs by reducing hallucinations and ensuring that generated outputs are grounded in visual inputs. This reduces the likelihood of disseminating inaccurate or misleading information. Furthermore, the Fine-Grained Self-Defense (FGSD) mechanism is specifically designed to detect and moderate unsafe content, thereby minimizing the risk of generating harmful, unethical, or illegal outputs.
However, despite these advancements, there are scenarios where the model may fail to identify or mitigate unsafe outputs, particularly in cases involving nuanced ethical dilemmas or adversarial attacks. Strengthening the robustness of safety mechanisms across diverse and complex scenarios remains an ongoing challenge that requires further exploration.
Bias and Fairness While our debiasing techniques address text modality bias in LVLMs, other forms of bias inherent in the training data or model architecture may still persist. These biases could result in unintended consequences, such as reinforcing stereotypes or generating outputs that disproportionately impact certain groups. Future research should focus on identifying, evaluating, and mitigating broader societal biases in both the data and model architectures to ensure fair and equitable behavior in LVLMs across various contexts.
Human Oversight and Accountability Our method reduces reliance on external datasets, human annotations, and judgment models, which improves scalability and efficiency. However, this raises concerns about the potential lack of human oversight. While the self-judgment capabilities of the model show promise, they may not always align with human ethical standards, especially in sensitive or high-stakes applications.
We believe that human oversight and intervention should remain integral to the deployment of LVLMs, particularly in critical domains such as healthcare, law, and education. Ensuring alignment with human ethical principles and maintaining accountability throughout the lifecycle of these systems is essential for their safe and responsible use.
Acknowledgements
We sincerely appreciate the reviewers and the AC for their valuable suggestions throughout the review process.
Appendix A Experimental Details
A.1 Implementation Details
A.1.1 Enhancing Faithfulness through DSGD
Sentence-Level Beam Search. We set the parameters as follows to balance both diversity and quality in the sampled data. The num_beams parameter is set to 5. Additionally, the num_token_beams is also configured to 5, ensuring that 5 token-level search results are returned per beam search. The eos_token_id is set to the token corresponding to a period (.), enabling sentence-by-sentence control of the generation process. Finally, is set to 1.
To increase data diversity, we implement group beam search by setting the num_beam_group parameter to 5. This technique, combined with token-level search, significantly enhances the diversity of the sampled data. Furthermore, we adjust the diversity_penalty parameter to 3.0, which regulates both diversity and quality among the different beam groups.
A.1.2 Ensuring Safety via FGSD
In FGSD, is set to 0.1. As described in equation 4, we sample 1000 questions from models’ training datasets, and calculate the unsafe score for LLaVA 1.5, InstructBLIP, and mPLUG-Owl2, setting the thresholds at 23, 22.4, and 14.9, respectively. The statistical results are shown in figures 6, 7, and 8. To calculate MCR, we sample data from MSCOCO (Lin et al., 2014), ShareGPT-4V (Chen et al., 2023), MovieNet (Huang et al., 2020), Google Landmark v2 (Weyand et al., 2020), VQA v2 (Goyal et al., 2017), OKVQA (Marino et al., 2019), and TextVQA (Singh et al., 2019)
A.1.3 Improving Overall Capability with DSR
The hyperparameters for generating the data are the same as those for DSGD. The training hyperparameters are listed in Table 6. The model was trained for 1 epoch, which took 6 hours on a single A100 80GB GPU.
A.2 Overview of Baselines
We evaluate our approach against several established decoding methods, including greedy decoding, nucleus sampling, Beam Search, DoLa (Chuang et al., 2023), visual contrastive decoding (VCD) (Leng et al., 2023), HALC (Chen et al., 2024b), LURE (Zhou et al., 2023), Woodpecker (Yin et al., 2023), and OPERA (Huang et al., 2023). Greedy decoding deterministically selects the highest-probability token at each step, while Beam Search extends this by exploring multiple high-probability sequences simultaneously. Nucleus sampling focuses on sampling from the top portion of the probability distribution. DoLa contrasts logits from different layers to mitigate hallucinations in LLMs. OPERA combats hallucinations by introducing an over-trust penalty and using a retrospection-allocation mechanism to reduce dependence on limited summary tokens. VCD, specifically designed for vision-language models, reduces object hallucinations by contrasting outputs from original and modified images. HALC is a decoding strategy that reduces object hallucinations by using an adaptive focal-contrast grounding mechanism to correct hallucinating tokens and a matching-based beam search to balance hallucination mitigation with text generation quality. LURE and Woodpecker respectively use MiniGPT-4 and GPT-3.5 to modify the hallucination-containing outputs of the models.
A.3 Evaluation Metrics and Benchmarks
In our experiments, we use tasks such as visual question answering Fu et al. (2024); Yang et al. (2025b); Zhao and Zhang (2024); Cao and Zhao (2025) and image captioning.
- •
MME Fu et al. (2024) offers a robust benchmark for evaluating LVLMs across multimodal tasks. It assesses models on two major fronts: perception and cognition, using 14 well-structured subtasks that challenge their interpretive and analytical abilities.
- •
SEED-Bench Li et al. (2023b) focuses on measuring the generative comprehension of LVLMs. It includes a large dataset of 19K multiple-choice questions, complete with human annotations, spanning 12 different evaluation dimensions to test both spatial and temporal reasoning across images and videos.
- •
LLaVAW Liu et al. (2023c) provides a targeted evaluation for visual reasoning models. It features 24 diverse images paired with 60 questions, covering a variety of scenarios, including indoor, outdoor, and abstract settings.
- •
MMBench Liu et al. (2024d) takes a two-pronged approach by introducing an extensive dataset that broadens the scope of evaluation questions and a novel CircularEval strategy that utilizes ChatGPT to convert free-form responses into structured answer choices.
- •
MM-Vet Yu et al. (2023b) is designed to assess LVLMs through a wide range of multimodal tasks, structured into 16 distinct integrations based on 6 core vision-language capabilities, providing a detailed performance analysis across different question types and answer formats.
- •
ScienceQA Lu et al. (2022) focuses on evaluating multi-hop reasoning and interpretability within scientific domains. It features a large dataset of approximately 21K multiple-choice questions across a variety of science topics, accompanied by detailed annotations and explanations.
- •
VizWiz Gurari et al. (2018) stands out in the VQA field by using a dataset of over 31,000 visual questions that come from a real-world setting, featuring images taken by visually impaired individuals and their associated spoken queries, along with crowdsourced answers.
- •
GQA Hudson and Manning (2019) is built for complex visual reasoning tasks, containing 22 million questions generated from scene graph-based structures. It incorporates innovative evaluation metrics focused on consistency, grounding, and plausibility, pushing the boundaries of vision-language evaluation.
- •
POPE Li et al. (2023d) introduces a methodology to evaluate object hallucination in LVLMs, transforming the task into a binary classification problem. By using simple Yes-or-No prompts, POPE highlights model tendencies towards hallucination through various object sampling strategies.
- •
CHAIR Rohrbach et al. (2019) is a widely-used metric for assessing object hallucination in image captioning. It includes two variants: CHAIR, which evaluates object hallucination at the instance level, and CHAIR, which does so at the sentence level. Both are defined as:
[TABLE]
[TABLE]
For our evaluation, we randomly sampled 500 images from the COCO Lin et al. (2014) validation set and applied the CHAIR metric to measure hallucinations.
- •
MM-SafetyBench Liu et al. (2024c) is a comprehensive safety evaluation framework for Multimodal Large Language Models (MLLMs). The benchmark targets models’ vulnerabilities to visual prompt attacks, particularly those triggered by harmful query-relevant images. It consists of 13 different scenarios (e.g., illegal activity, hate speech, physical harm), represented by 5,040 text-image pairs, to assess how well MLLMs can avoid producing unsafe responses. Experimental results show that many MLLMs, including state-of-the-art models like LLaVA-1.5, are highly susceptible to attacks, especially when prompted with query-relevant images. MM-SafetyBench helps quantify these risks and provides insights into improving the safety protocols of MLLMs.
- •
FaithScore Jing et al. (2024) is a reference-free, fine-grained evaluation metric designed to measure the faithfulness of free-form answers generated by large vision-language models (LVLMs). FaithScore evaluates the consistency between descriptive sub-sentences in the generated answers and the input images. The process involves three steps: (1) identifying descriptive sub-sentences, (2) extracting atomic facts from these sub-sentences, and (3) verifying these facts against the input image. FaithScore has shown a strong correlation with human judgments on faithfulness, providing a more interpretable and fine-grained evaluation compared to existing metrics.
Appendix B Efficiency Analysis
Large Models face effciency challenge Yang et al. (2025a); Hu et al. (2025); Li et al. (2025). We present a comparison of time efficiency between DSGD and other approaches in Table 7.
Appendix C More Result
C.1 Quantitative Analysis of Self-Judgment Score
We extend our investigation beyond the LLaVA-1.5-7B model by including results for InstructBLIP and mPLUG-Owl2, as detailed in Table 8. Here, Spearman’s rank correlation coefficients measure how strongly two variables increase or decrease together, ranging from to , with higher values indicating a stronger positive relationship. Positive values indicate a positive correlation, while negative values indicate a negative correlation. These additional analyses further confirm the existence of bias toward the textual modality in the self-judgment of LVLMs.
C.2 Scalability Study of DSR with Larger-Scale Preference Data
We further investigate the scalability of DSR by increasing the amount of preference data used for training. Specifically, we compare the performance of DSR when trained with 6K and 10K preference data, alongside the original LLaVA-1.5-7B baseline. As shown in Table 9, increasing the training data from 6K to 10K leads to consistent improvements across most benchmarks. Notably, DSR achieves the best or tied-best results on all metrics when scaled to 10K data, demonstrating its strong scalability and effectiveness. These findings indicate that DSR can effectively leverage larger-scale preference data to further enhance the overall capability of LVLMs.
C.3 VILA Experiments with DSR
To evaluate the generalizability of DSR, we applied it to the advanced VILA (Lin et al., 2024) model across various benchmarks. Table 10 presents the experimental results of VILA combined with different preference data curation methods: the baseline VILA, VILA+CSR, and VILA+DSR.
C.4 Settings of Hyper-parameters
Further ablation studies on the effects of hyper-parameters are presented in Figures 9, 10, 11 and Table 11. Figure 9 illustrates the effect of number of beams in DSGD. Figure 10 illustrates the effect of diversity_penalty in DSGD. Figure 11 illustrates the effect of in DSGD. Table 11 illustrates the effect of in FGSD.
C.5 Ablation Studies
The ablation study results for FGSD and DSR can be found in Table 12 and Table 13.
C.6 Case Studies
Figure 12 presents a case where our approach enhances faithfulness. Figure 13 illustrates how our method safely prevents an attack, while Figure 14 demonstrates that the model cannot assess the safety of the response without image input.
Appendix D Prompt Design
The detailed prompt designs for each task are shown in Tables 14, 15, 16, 17, and 18.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774 .
- 2Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxu
- 3Cao and Zhao (2025) Linbo Cao and Jinman Zhao. 2025. Pretraining on the test set is no longer all you need: A debate-driven approach to qa benchmarks . Preprint , ar Xiv:2507.17747.
- 4Chan et al. (2023) Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. ar Xiv preprint ar Xiv:2308.07201 .
- 5Chang et al. (2024) Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology , 15(3):1–45.
- 6Chen et al. (2023) Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2023. Sharegpt 4v: Improving large multi-modal models with better captions. ar Xiv preprint ar Xiv:2311.12793 .
- 7Chen et al. (2024 a) Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. 2024 a. Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14239–14250.
- 8Chen et al. (2024 b) Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. 2024 b. Halc: Object hallucination reduction via adaptive focal-contrast decoding. ar Xiv preprint ar Xiv:2403.00425 .
