TL;DR
This paper introduces a novel alignment mechanism leveraging attention networks and a latent space to improve multimodal recipe question answering, achieving significant performance gains on the RecipeQA dataset.
Contribution
It presents a new alignment approach with constrained max-pooling for better cross-modal reasoning in multimodal QA tasks.
Findings
19% performance improvement over baselines
Effective alignment of instructions and images in recipes
Enhanced reading comprehension on multimodal data
Abstract
We propose a novel alignment mechanism to deal with procedural reasoning on a newly released multimodal QA dataset, named RecipeQA. Our model is solving the textual cloze task which is a reading comprehension on a recipe containing images and instructions. We exploit the power of attention networks, cross-modal representations, and a latent alignment space between instructions and candidate answers to solve the problem. We introduce constrained max-pooling which refines the max-pooling operation on the alignment matrix to impose disjoint constraints among the outputs of the model. Our evaluation result indicates a 19\% improvement over the baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
