Combining Multiple Cues for Visual Madlibs Question Answering

Tatiana Tommasi; Arun Mallya; Bryan Plummer; Svetlana Lazebnik,; Alexander C. Berg; Tamara L. Berg

arXiv:1611.00393·cs.CV·February 9, 2018

Combining Multiple Cues for Visual Madlibs Question Answering

Tatiana Tommasi, Arun Mallya, Bryan Plummer, Svetlana Lazebnik,, Alexander C. Berg, Tamara L. Berg

PDF

Open Access

TL;DR

This paper introduces a multi-cue approach using specialized networks and spatial localization to improve visual question answering on the Visual Madlibs dataset, significantly outperforming previous methods.

Contribution

It proposes a novel combination of specialized networks, spatial localization, and joint embedding for enhanced visual question answering.

Findings

01

Significant performance improvement over previous state-of-the-art.

02

Using diverse specialized cues enhances answer accuracy.

03

Spatial support for feature extraction is crucial for success.

Abstract

This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset. Instead of generic and commonly used representations trained on the ImageNet classification task, our approach employs a combination of networks trained for specialized tasks such as scene recognition, person activity classification, and attribute prediction. We also present a method for localizing phrases from candidate answers in order to provide spatial support for feature extraction. We map each of these features, together with candidate answers, to a joint embedding space through normalized canonical correlation analysis (nCCA). Finally, we solve an optimization problem to learn to combine scores from nCCA models trained on multiple cues to select the best answer. Extensive experimental results show a significant improvement over the previous state of the art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Text and Document Classification Technologies