PostAlign: Multimodal Grounding as a Corrective Lens for MLLMs

Yixuan Wu; Yang Zhang; Jian Wu; Philip Torr; Jindong Gu

arXiv:2506.17901·cs.CV·June 24, 2025

PostAlign: Multimodal Grounding as a Corrective Lens for MLLMs

Yixuan Wu, Yang Zhang, Jian Wu, Philip Torr, Jindong Gu

PDF

3 Reviews

TL;DR

PostAlign is a post-processing framework for multimodal large language models that improves visual grounding accuracy and reduces hallucinations by incorporating multimodal grounding modules and selective reasoning mechanisms.

Contribution

It introduces MMGrounded-PostAlign, a novel post-multimodal alignment framework with grounding modules and rejection mechanisms to enhance visual understanding and mitigate hallucinations in MLLMs.

Findings

01

Significant improvements on POPE, HaloQuest, VQAv2 benchmarks.

02

Effective reduction of hallucinations in multimodal models.

03

Enhanced fine-grained visual grounding capabilities.

Abstract

Multimodal Large Language Models (MLLMs) excel in vision-language tasks, such as image captioning and visual question answering. However, they often suffer from over-reliance on spurious correlations, primarily due to linguistic priors that distract the model from leveraging actual visual information. To address these issues, we introduce MMGrounded-PostAlign, a post-multimodal alignment framework designed to enhance the visual understanding capabilities and mitigate the hallucinations of MLLMs. Our framework incorporates a multimodal grounding module for both visual grounding, which identifies the referred object in the image, and textual grounding, which generates the rationale for the final answer, ensuring that outputs are anchored in both visual and textual evidence. To mitigate the hallucinations, we introduce a negative rejection mechanism in the visual grounding module to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

- Clear motivation: Tackles language-prior-driven hallucination via explicit multimodal grounding; neat idea of "grounding as a corrective lens". - practical design: Simple LOC/REJ interface to a multi-task decoder; selective reasoning avoids unnecessary rationale generation. - Broad evaluation: Covers hallucination, general V+L, and grounding benchmarks with meaningful ablations.

Weaknesses

- Limited generality:Only tested on LLaVA-1.5 (7B/13B) + SAM-ViT-H; cross-backbone evidence (e.g., Qwen-VL, other grounding encoders) missing. - The idea of labeling queries as SIMPLE vs COMPLEX is good, but doing so at the dataset level rather than per sample raises concern. Some “simple” dataset queries might still require reasoning, and vice-versa. - The REJ token is an interesting idea, but the paper does not show the cases where referent exists but system rejects.

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper clearly identifies and addresses a critical problem in MLLMs: the over-reliance on linguistic priors, which leads to hallucinations and a failure to ground responses in visual evidence. The proposed "post-alignment" framework is a well-motivated and logical approach to re-center the model's outputs on visual information. 2. The experimental analysis provides valuable insights. "Finding 1" (Figure 3), which empirically demonstrates how linguistic priors can override visual informati

Weaknesses

1. Insufficient baseline comparisons: A significant weakness is the lack of comparison against the original, unmodified baseline model, as well as other well-established MLLMs. The model is built on LLaVA-1.5, yet Tables 1-3 primarily compare variants of the proposed method against an internal baseline (the framework with modules removed), not against the original LLaVA-1.5. This makes it difficult to assess the true impact (including any potential performance trade-offs) of the added components

Reviewer 03Rating 4Confidence 4

Strengths

- It accurately identifies two key issues of Multimodal Large Language Models (MLLMs): "hallucinations" (generating non-existent content) and "insufficient fine-grained visual understanding", which are caused by the models' over-reliance on linguistic priors. These two types of issues serve as core bottlenecks that undermine the robustness and reliability of current MLLMs in vision-language tasks, making the research direction highly practically significant and necessary. - While enhancing visua

Weaknesses

- In textual grounding, the <SIMPLE>/<COMPLEX> labels are categorized at the "dataset level" (e.g., queries in the COCO dataset are classified as <SIMPLE>, while those in the ReasonSeg dataset are classified as <COMPLEX>), rather than being annotated at the "sample level". Although this approach reduces annotation costs and ensures training stability, it fails to handle scenarios where "simple queries and complex queries are mixed" within the same dataset. This may lead to inaccurate matching of

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.