TL;DR
This paper introduces a novel loss function for visual-textual grounding that enhances bounding box accuracy and improves the balance between feature learning and bounding box prediction, outperforming existing models.
Contribution
Proposes a new loss function based on bounding box class probabilities that improves both bounding box selection and coordinate prediction in visual-textual grounding models.
Findings
Achieves higher accuracy than state-of-the-art models on benchmark datasets.
Enhances the balance between multi-modal feature learning and bounding box refinement.
Uses a simple multi-modal fusion component with improved loss function.
Abstract
Given a textual phrase and an image, the visual grounding problem is the task of locating the content of the image referenced by the sentence. It is a challenging task that has several real-world applications in human-computer interaction, image-text reference resolution, and video-text reference resolution. In the last years, several works have addressed this problem by proposing more and more large and complex models that try to capture visual-textual dependencies better than before. These models are typically constituted by two main components that focus on how to learn useful multi-modal features for grounding and how to improve the predicted bounding box of the visual mention, respectively. Finding the right learning balance between these two sub-tasks is not easy, and the current models are not necessarily optimal with respect to this issue. In this work, we propose a loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
