Cosine meets Softmax: A tough-to-beat baseline for visual grounding

Nivedita Rufus; Unni Krishnan R Nair; K. Madhava Krishna; Vineet; Gandhi

arXiv:2009.06066·cs.CV·September 15, 2020

Cosine meets Softmax: A tough-to-beat baseline for visual grounding

Nivedita Rufus, Unni Krishnan R Nair, K. Madhava Krishna, Vineet, Gandhi

PDF

1 Repo

TL;DR

This paper introduces a simple yet effective baseline for visual grounding in autonomous driving, outperforming previous methods by leveraging cosine similarity and minimal design, challenging the need for complex models.

Contribution

The authors propose a minimalistic approach using cosine distance and pre-trained embeddings, achieving state-of-the-art results with less complexity.

Findings

01

Achieved 68.7% AP50 accuracy on Talk2Car dataset

02

Outperformed previous state-of-the-art by 8.6%

03

Showed simpler methods can be competitive with complex models

Abstract

In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices. Our framework minimizes the cross-entropy loss over the cosine distance between multiple image ROI features with a text embedding (representing the give sentence/phrase). We use pre-trained networks for obtaining the initial embeddings and learn a transformation layer on top of the text embedding. We perform experiments on the Talk2Car dataset and achieve 68.7% AP50 accuracy, improving upon the previous state of the art by 8.6%. Our investigation suggests reconsideration towards more approaches employing sophisticated attention mechanisms or multi-stage reasoning or complex metric learning loss functions by showing promise in simpler alternatives.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

niveditarufus/CMSVG
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.