PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

Mennatullah Siam

arXiv:2502.04192·cs.CV·January 27, 2026

PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

Mennatullah Siam

PDF

Open Access 1 Repo 1 Models 2 Datasets

TL;DR

This paper critically evaluates pixel-level vision foundation models, revealing their limitations in visual question answering and grounding, and introduces new benchmarks, analysis tools, and insights into grounding emergence.

Contribution

It introduces paired benchmarks for VQA and grounding, analyzes grounding emergence, and provides interpretability tools for MLLMs, challenging current training paradigms.

Findings

01

Simple baselines outperform some pixel-level MLLMs

02

Grounding can relate to object parts, location, or context

03

Grounding does not always match referring expressions

Abstract

Multiple works have emerged to push the boundaries of multi-modal large language models (MLLMs) towards pixel-level understanding. The current trend is to train MLLMs with pixel-level grounding supervision in terms of masks on large-scale labelled data and specialized decoders for the segmentation task. However, we show that such MLLMs when evaluated on recent challenging vision-centric benchmarks, exhibit a weak ability in visual question answering (VQA). Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such pixel-level supervision. In this work, we propose two novel challenging benchmarks with paired evaluation for both VQA and grounding. We demonstrate that simple baselines that are not unified achieve performance that matches or surpasses some of the pixel-level MLLMs. Our paired benchmarks and evaluation enable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

msiam/pixfoundation
pytorchOfficial

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage