ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding

Junyi Hu; Tian Bai; Fengyi Wu; Wenyan Li; Zhenming Peng; Yi Zhang

arXiv:2601.22666·cs.CV·February 2, 2026

ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding

Junyi Hu, Tian Bai, Fengyi Wu, Wenyan Li, Zhenming Peng, Yi Zhang

PDF

Open Access

TL;DR

ExpAlign introduces a theoretically grounded, expectation-guided vision-language alignment framework that enhances open-vocabulary detection and segmentation without requiring explicit supervision, achieving state-of-the-art results.

Contribution

The paper presents ExpAlign, a novel alignment framework using multiple instance learning and energy-based regularization for improved weakly supervised vision-language tasks.

Findings

01

Achieves 36.2 AP_r on LVIS minival, outperforming comparable methods.

02

Improves open-vocabulary detection and zero-shot segmentation, especially on long-tail categories.

03

Remains lightweight and inference-efficient.

Abstract

Open-vocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques