A Better Loss for Visual-Textual Grounding

Davide Rigoni; Luciano Serafini; Alessandro Sperduti

arXiv:2108.05308·cs.CV·February 3, 2022

A Better Loss for Visual-Textual Grounding

Davide Rigoni, Luciano Serafini, Alessandro Sperduti

PDF

1 Repo

TL;DR

This paper introduces a novel loss function for visual-textual grounding that enhances bounding box accuracy and improves the balance between feature learning and bounding box prediction, outperforming existing models.

Contribution

Proposes a new loss function based on bounding box class probabilities that improves both bounding box selection and coordinate prediction in visual-textual grounding models.

Findings

01

Achieves higher accuracy than state-of-the-art models on benchmark datasets.

02

Enhances the balance between multi-modal feature learning and bounding box refinement.

03

Uses a simple multi-modal fusion component with improved loss function.

Abstract

Given a textual phrase and an image, the visual grounding problem is the task of locating the content of the image referenced by the sentence. It is a challenging task that has several real-world applications in human-computer interaction, image-text reference resolution, and video-text reference resolution. In the last years, several works have addressed this problem by proposing more and more large and complex models that try to capture visual-textual dependencies better than before. These models are typically constituted by two main components that focus on how to learn useful multi-modal features for grounding and how to improve the predicted bounding box of the visual mention, respectively. Finding the right learning balance between these two sub-tasks is not easy, and the current models are not necessarily optimal with respect to this issue. In this work, we propose a loss…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

drigoni/Loss_VT_Grounding
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.