The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding
Jiayun Luo, Mir Rayat Imtiaz Hossain, Pritam Sarkar, Boyang Li, Leonid Sigal

TL;DR
This paper introduces CompART, a training method that improves multi-object visual grounding and understanding in vision-language models by decomposing captions and aligning attention, without extra annotations.
Contribution
CompART is a novel training approach that enhances multi-object grounding and visual understanding in VLMs through attention regularization and caption decomposition.
Findings
CompART improves grounding accuracy for multi-object references across various models.
CompART enhances VQA performance without explicit training for this task.
CompART consistently outperforms baseline models on multiple benchmarks.
Abstract
Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that grounding performance degrades for complex, multi-object references. These limitations largely arise from training objectives that leverage image-caption alignment, where direct multi-object references are rare, the number of possible such references is theoretically large (exponential in the number of objects), and attribution is difficult. To address this, without requiring any additional annotations, we propose Compositional Attention-Regularized Training (CompART), which decomposes captions into object-centric phrases and constructs composite phrases by pairing them with conjunctions. We then introduce a composition loss that encourages the attention induced by a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
