COFAR: Commonsense and Factual Reasoning in Image Search

Prajwal Gatti; Abhirama Subramanyam Penamakuri; Revant Teotia; Anand; Mishra; Shubhashis Sengupta; Roshni Ramnani

arXiv:2210.08554·cs.CV·October 18, 2022·1 cites

COFAR: Commonsense and Factual Reasoning in Image Search

Prajwal Gatti, Abhirama Subramanyam Penamakuri, Revant Teotia, Anand, Mishra, Shubhashis Sengupta, Roshni Ramnani

PDF

Open Access 1 Repo

TL;DR

This paper introduces KRAMT, a unified multimodal framework that enhances image search by integrating commonsense and factual knowledge, enabling reasoning beyond visual recognition, and demonstrates its effectiveness on the new COFAR dataset.

Contribution

The paper presents KRAMT, a novel knowledge retrieval-augmented multimodal transformer that incorporates encyclopedic knowledge into image search, addressing limitations of visual-only recognition.

Findings

01

KRAMT outperforms existing methods on the COFAR dataset.

02

The framework effectively grounds knowledge in visual content.

03

Enhanced reasoning improves search accuracy for complex queries.

Abstract

One characteristic that makes humans superior to modern artificially intelligent models is the ability to interpret images beyond what is visually apparent. Consider the following two natural language search queries - (i) "a queue of customers patiently waiting to buy ice cream" and (ii) "a queue of tourists going to see a famous Mughal architecture in India." Interpreting these queries requires one to reason with (i) Commonsense such as interpreting people as customers or tourists, actions as waiting to buy or going to see; and (ii) Fact or world knowledge associated with named visual entities, for example, whether the store in the image sells ice cream or whether the landmark in the image is a Mughal architecture located in India. Such reasoning goes beyond just visual recognition. To enable both commonsense and factual reasoning in the image search, we present a unified framework,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vl2g/cofar
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Residual Connection · Dropout