COFAR: Commonsense and Factual Reasoning in Image Search
Prajwal Gatti, Abhirama Subramanyam Penamakuri, Revant Teotia, Anand, Mishra, Shubhashis Sengupta, Roshni Ramnani

TL;DR
This paper introduces KRAMT, a unified multimodal framework that enhances image search by integrating commonsense and factual knowledge, enabling reasoning beyond visual recognition, and demonstrates its effectiveness on the new COFAR dataset.
Contribution
The paper presents KRAMT, a novel knowledge retrieval-augmented multimodal transformer that incorporates encyclopedic knowledge into image search, addressing limitations of visual-only recognition.
Findings
KRAMT outperforms existing methods on the COFAR dataset.
The framework effectively grounds knowledge in visual content.
Enhanced reasoning improves search accuracy for complex queries.
Abstract
One characteristic that makes humans superior to modern artificially intelligent models is the ability to interpret images beyond what is visually apparent. Consider the following two natural language search queries - (i) "a queue of customers patiently waiting to buy ice cream" and (ii) "a queue of tourists going to see a famous Mughal architecture in India." Interpreting these queries requires one to reason with (i) Commonsense such as interpreting people as customers or tourists, actions as waiting to buy or going to see; and (ii) Fact or world knowledge associated with named visual entities, for example, whether the store in the image sells ice cream or whether the landmark in the image is a Mughal architecture located in India. Such reasoning goes beyond just visual recognition. To enable both commonsense and factual reasoning in the image search, we present a unified framework,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Residual Connection · Dropout
