ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding
Junyi Hu, Tian Bai, Fengyi Wu, Wenyan Li, Zhenming Peng, Yi Zhang

TL;DR
ExpAlign introduces a theoretically grounded, expectation-guided vision-language alignment framework that enhances open-vocabulary detection and segmentation without requiring explicit supervision, achieving state-of-the-art results.
Contribution
The paper presents ExpAlign, a novel alignment framework using multiple instance learning and energy-based regularization for improved weakly supervised vision-language tasks.
Findings
Achieves 36.2 AP_r on LVIS minival, outperforming comparable methods.
Improves open-vocabulary detection and zero-shot segmentation, especially on long-tail categories.
Remains lightweight and inference-efficient.
Abstract
Open-vocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
