TL;DR
This paper introduces a simple yet effective baseline for visual grounding in autonomous driving, outperforming previous methods by leveraging cosine similarity and minimal design, challenging the need for complex models.
Contribution
The authors propose a minimalistic approach using cosine distance and pre-trained embeddings, achieving state-of-the-art results with less complexity.
Findings
Achieved 68.7% AP50 accuracy on Talk2Car dataset
Outperformed previous state-of-the-art by 8.6%
Showed simpler methods can be competitive with complex models
Abstract
In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices. Our framework minimizes the cross-entropy loss over the cosine distance between multiple image ROI features with a text embedding (representing the give sentence/phrase). We use pre-trained networks for obtaining the initial embeddings and learn a transformation layer on top of the text embedding. We perform experiments on the Talk2Car dataset and achieve 68.7% AP50 accuracy, improving upon the previous state of the art by 8.6%. Our investigation suggests reconsideration towards more approaches employing sophisticated attention mechanisms or multi-stage reasoning or complex metric learning loss functions by showing promise in simpler alternatives.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
