Latent Alignment of Procedural Concepts in Multimodal Recipes

Hossein Rajaby Faghihi; Roshanak Mirzaee; Sudarshan Paliwal; and; Parisa Kordjamshidi

arXiv:2101.04727·cs.CL·January 14, 2021

Latent Alignment of Procedural Concepts in Multimodal Recipes

Hossein Rajaby Faghihi, Roshanak Mirzaee, Sudarshan Paliwal, and, Parisa Kordjamshidi

PDF

1 Repo

TL;DR

This paper introduces a novel alignment mechanism leveraging attention networks and a latent space to improve multimodal recipe question answering, achieving significant performance gains on the RecipeQA dataset.

Contribution

It presents a new alignment approach with constrained max-pooling for better cross-modal reasoning in multimodal QA tasks.

Findings

01

19% performance improvement over baselines

02

Effective alignment of instructions and images in recipes

03

Enhanced reading comprehension on multimodal data

Abstract

We propose a novel alignment mechanism to deal with procedural reasoning on a newly released multimodal QA dataset, named RecipeQA. Our model is solving the textual cloze task which is a reading comprehension on a recipe containing images and instructions. We exploit the power of attention networks, cross-modal representations, and a latent alignment space between instructions and candidate answers to solve the problem. We introduce constrained max-pooling which refines the max-pooling operation on the alignment matrix to impose disjoint constraints among the outputs of the model. Our evaluation result indicates a 19\% improvement over the baselines.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HLR/LatentAlignmentProcedural
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.